Evaluating equation (3.4) with \(N1,2,\hat{x}_{1,2}\) and \(s_{1,2}\) gives a t-value of \(t_{1,2}=-2.740403\). The following R-snippet shows how the two samples were generated and how the t-value was calculated with the user function \(“t.val()”\) (note that both samples truly differ by \(Δμ=5\) with “true” uncertainty \(σ=5\)).
rm(list=ls()) # Delete all datasets and variables
set.seed(1234) # sets a seed. It is good to use the same seed all the time.
t.val <- function(pop1,pop2){ #specify the t value
x1 <- mean(pop1,na.rm=T) #specify the x1 value from mean
x2 <- mean(pop2,na.rm=T) #specify the x2 value from mean
n1 <- length(pop1[!is.na(pop1)]) #specify n1 from length
n2 <- length(pop2[!is.na(pop2)]) #specify n2 from length
s2.1 <- var(pop1,na.rm=T) #specify s2.1 from variance
s2.2 <- var(pop2,na.rm=T) #specify s2.2 from variance
s.pool <- sqrt( ( (n1-1)*s2.1 + (n2-1)*s2.2 ) / (n1+n2-2))
t <- (x1-x2)/(s.pool*sqrt( 1/n1 + 1/n2 ) )
return(t)
} #specify the t value
old <- rnorm(10,135,5) #old is normal distribution with n=10, mean =135 and var = 5
new <- rnorm(10,140,5) #new is normal distribution with n=10, mean =140 and var = 5
cat("Old(N=10):",round( mean(old),2),"+/-", round(sd(old),2)) #producing output old, with mean and standard deviation are decimal number.## Old(N=10): 133.08 +/- 4.98
Summary : The value of the function old with N=10 has a mean of 133.08 and a standard deviation of approximately 4.98
## New(N=10): 139.41 +/- 5.34
Summary : The value of the function new with N=10 has a mean of 139.41 and a standard deviation of approximately 5.34 greater than old function.
## t-value= -2.740403
As such, the value of \(t=-2.74\) does not tell much, but it would tell more if the distribution of t were known under the assumption that both samples come from the same distribution. This distribution can easily be obtained by Monte Carlo simulation as shown with the following R-code. The code generates 100000 t-values by sampling from a normal distribution, \(N(μ=0,σ^2=1)\), with sample size Nold,new=10 and shows the resulting distribution as histogram, figure 3.10. It then counts the number of t-values exeeding of falling short of \(±2.74\) (that is \(N|abs(t)>2.74)\) and reports this as a percentage value, \(α=0.01312\).
t.val <- function(pop1,pop2){ #specify the t function
x1 <- mean(pop1,na.rm=T) #specify the x1 value from mean
x2 <- mean(pop2,na.rm=T) #specify the x2 value from mean
n1 <- length(pop1[!is.na(pop1)]) #specify n1 from length
n2 <- length(pop2[!is.na(pop2)]) #specify n2 from length
s2.1 <- var(pop1,na.rm=T) #specify s2.1 from variance
s2.2 <- var(pop2,na.rm=T) #specify s2.2 from variance
s.pool <- sqrt( ( (n1-1)*s2.1 + (n2-1)*s2.2 ) / (n1+n2-2))
t <- (x1-x2)/(s.pool*sqrt( 1/n1 + 1/n2 ) )
return(t) #return from function t
} #specify the t function
set.seed(1234) #sets a seed. It is good to use the same seed all the time.
t <- NULL # Observations in _each_ group
for (i in 1:100000) { # Loop through the vector, adding one
old <- rnorm(10,0,1) #old is normal distribution with n=10, mean =0, and var = 1
new <- rnorm(10,0,1) #new is normal distribution with n=10, mean =0, and var = 1
t <- c(t,t.val(old,new)) #Specify the t function
}
hist(t, breaks=60,col="lightblue",main="")
abline(v=c(-2.740403,0,2.740403),
lty=c(2,1,2),col=c("red","black","red"))
box() #histogram visualitation Monte Carlo t-distribution of sample size \(N=10\) obtained by 100000 resampling steps from \(∼N(0,1)\); the test value \(±2.74\) in the present example is marked red in the plot. This plot have normal distribution
##
## FALSE TRUE
## 0.98688 0.01312
Of course, there are ready-made functions for performing t-tests in all statistical software packages, in R, e.g., the base function t.test():
set.seed(1234) #sets a seed. It is good to use the same seed all the time.
old <- rnorm(10,135,5) #old is normal distribution with n=10, mean =135, and var = 5
new <- rnorm(10,140,5) #new is normal distribution with n=10, mean =140, and var = 5
t.test(old,new, var.equal=T ) #specify the t test value from function old and new##
## Two Sample t-test
##
## data: old and new
## t = -2.7404, df = 18, p-value = 0.01344
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.173926 -1.475941
## sample estimates:
## mean of x mean of y
## 133.0842 139.4091
The t-tests reports an \(α-value^14\) of \(0.01344\) in good agreement with the result obtained from Monte Carlo simulation. In addition the parameter df in the output reports the degrees of freedom, \(df=18\), of the test. The t-test is based on two “internal” parameters,\(\bar{X}_1\) and \(\bar{X}_2\) and these two parameters must be deducted from the total sample size \(N=20\), hence \(df=1815\). By convention a test result is called significant in statistics when \(0.01≤α≤0.05\) and highly significant when \(α<0.01\). When a test result is jugded significant given α, H0 is rejected in favour of H1 while silently accepting that in \(100∗α%\) of the cases \(H0\) will be true.
pwr::pwr.t.test( # Power analysis, t-test
n = NULL, # Observations in _each_ group
d = 1,
sig.level = 0.05, #Type I probability
power = 0.9,alternative = c("greater")) #alternative hypothesis##
## Two-sample t test power calculation
##
## n = 17.84712
## d = 1
## sig.level = 0.05
## power = 0.9
## alternative = greater
##
## NOTE: n is number in *each* group
from power analysis above obtained n=17.84 , d=1 and significant level 0.05, and alternative hypothesis 0.9
Least Squares problems can be solved conveniently with statistical software, making Ordinary Least Squares (OLS) to the most widely used statistical method in science. In R the lm() function provides all functionalities for solving OLS problems, e.g. with the R example code
###### plot SS
set.seed(1234) #sets a seed. It is good to use the same seed all the time.
x <- seq(0,5,length=30)
y <- 3 + 5*x + rnorm(length(x),0,1)
x.df <- data.frame(y,x) # make a data.frame
res <- lm(y~x,data=x.df) # do linear modeling
summary(res) # show results ##
## Call:
## lm(formula = y ~ x, data = x.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0306 -0.5751 -0.2109 0.5522 2.7050
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.6800 0.3273 8.188 6.51e-09 ***
## x 5.0094 0.1124 44.562 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9189 on 28 degrees of freedom
## Multiple R-squared: 0.9861, Adjusted R-squared: 0.9856
## F-statistic: 1986 on 1 and 28 DF, p-value: < 2.2e-16
from summary res obtained residual standard error 0,9189 on degrees of freedom 28 , and multiple R-square 0.9189, p-value greater than significant level so that accept null hypothesis
a1 <- cov(x,y)/var(x) # analytical OLS solution
a0 <- mean(y) - a1*mean(x)
c(a0=a0,a1=a1) # show OLS estimators a0,a1## a0 a1
## 2.680009 5.009426
a1 is 5.009 and a0 is 2.68 , so that a1 greater than a0
## y x
## y 1.0000000 0.9930235
## x 0.9930235 1.0000000
it show correlation dari x and y variable
## [1] 0.9188509
Determine residual standar error from res function that is 0.9188
## [1] 0.9860958
correlation from y hat
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 1676.56 1676.56 1985.8 < 2.2e-16 ***
## Residuals 28 23.64 0.84
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "Df" "Sum Sq" "Mean Sq" "F value" "Pr(>F)"
ANOVA table from res
## [1] 0.9860958
value of R-square 0.98 equal to 98% so that independent variable affect to the dependent variable.
## [1] 1985.773
we have F-value 1985.8 and F.test 1985.773 so that F-value greater than F-test
## [1] 0
above probability of F value greater than F-test so that accept H0 The R-code for estimating and testing the contrast (new-old) with OLS then becomes
set.seed(1234) #sets a seed. It is good to use the same seed all the time.
rm(list=ls()) #Delete all datasets and variables
set.seed(1234) #sets a seed. It is good to use the same seed all the time.
old <- rnorm(10,135,5) #old is normal distribution with n=10, mean =135, and var = 5.
new <- rnorm(10,140,5) #new is normal distribution with n=10, mean =140, and var = 5.
x <- data.frame(obsnr=1:20,treat=c(rep("old",10),
rep("new",10) ), y=round(c(old,new),2))
levels(x$treat) # wrong order with "new" being the reference## NULL
##
## Call:
## lm(formula = y ~ treat, data = x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.815 -3.365 -0.930 3.495 12.672
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 139.408 1.632 85.404 <2e-16 ***
## treatold -6.323 2.308 -2.739 0.0135 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.162 on 18 degrees of freedom
## Multiple R-squared: 0.2942, Adjusted R-squared: 0.255
## F-statistic: 7.502 on 1 and 18 DF, p-value: 0.01348
Above summary of Linear model , we get residual standard error 5.162 on 18 degree of freedom and R-Square 0.29 equal to 29%
Outliers are a reality in empirical data and might reflect a lack of experimental control or alternatively model inadequacy, i.e. a non-linearity not approriately described by the empirical model. Depending on the objective of the \(project^20\), outlier should be checked by repeating the outlying experiments.
set.seed(1234) #sets a seed. It is good to use the same seed all the time.
x <- seq(0,5,length=30) #spesify funtion of x
XX <- as.matrix(cbind(1,x)) #spesify funtion of xx
y <- 3 + 5*x + rnorm(length(x),0,5) #spesify funtion of y
y[x > 3.8 & x < 4.2] <- y[x > 3.8 & x < 4.2] - 25
res <- lm(y~x) #spesify funtion of res from linear model x and y
layout(matrix(c(1,2,3,4), 2, 2, byrow = TRUE))
plot(x,y,type="n",pch=16, ylim=c(-20,40), main="y~x")
text(x,y,label=1:length(y))
abline(lm(y~x)$coef)
grid()
plot(predict(res), rstudent(res),xlab=expression( paste("predicted response ", hat(y)) ) ,ylim=c(-5,3) ,
ylab="studentized residuals",type="n",
main=expression(paste("(",frac(paste("y - ", hat(y)),sigma), ") ~ ", hat(y) )) )
text(predict(res), rstudent(res),label=1:length(y))
abline(h=c(-2,0,2),lty=c(2,1,2))
grid()
plot(1:length(y), rstudent(res),xlab="run order",ylab="studentized residuals",type="l",
main=expression(paste("(",frac(paste("y - ", hat(y)),sigma), ") ~ run order" ))) #create plot visualitation
abline(h=c(-2,0,2),lty=c(2,1,2)) #create abline in visualitation
grid()
acf(rstudent(res), main="Autocorrelation function of residuals") #show auto-correlation function plot Effects of outlying observations on OLS estimation.
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.4949 -1.3215 0.6568 3.2538 16.0924
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.6366 2.6802 0.984 0.333666
## x 3.8858 0.9205 4.221 0.000232 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.524 on 28 degrees of freedom
## Multiple R-squared: 0.3889, Adjusted R-squared: 0.3671
## F-statistic: 17.82 on 1 and 28 DF, p-value: 0.0002315
above show a linear model of x and y, we get residual standar error 7.524 on 28 degrees of freedom and Multiple R-squared: 0.3889 equal to 38% so that independent variable doesn’t really affect to dependent variable / weak corrrelation, so that refuse H0