An introduction into linear statistic

Evaluating equation (3.4) with \(N1,2,\hat{x}_{1,2} and s_{1,2}\) gives a t-value of \(t_{1,2}=-2.740403\). The following R-snippet shows how the two samples were generated and how the t-value was calculated with the user function \(“t.val()”\) (note that both samples truly differ by \(Δμ=5\) with “true” uncertainty \(σ=5\)).

rm(list=ls())  # Delete all datasets and variables
set.seed(1234) # sets a seed.  It is good to use the same seed all the time.
t.val <- function(pop1,pop2){   #specify the t value
  x1    <- mean(pop1,na.rm=T)   #specify the x1 value from mean
  x2    <- mean(pop2,na.rm=T)   #specify the x2 value from mean
  n1    <- length(pop1[!is.na(pop1)])  #specify n1 from length
  n2    <- length(pop2[!is.na(pop2)])  #specify n2 from length
  s2.1  <- var(pop1,na.rm=T)  #specify s2.1 from variance
  s2.2  <- var(pop2,na.rm=T)  #specify s2.2 from variance
  s.pool <- sqrt( ( (n1-1)*s2.1 +  (n2-1)*s2.2  ) / (n1+n2-2))
  t      <- (x1-x2)/(s.pool*sqrt( 1/n1 + 1/n2  ) )
  return(t)
}   #specify the t value
old    <- rnorm(10,135,5) #old is normal distribution with n=10, mean =135 and var = 5
new    <-  rnorm(10,140,5) #new is normal distribution with n=10, mean =140 and var = 5
cat("Old(N=10):",round( mean(old),2),"+/-", round(sd(old),2)) #producing output old, with mean and standard deviation are decimal number.

## Old(N=10): 133.08 +/- 4.98

cat("New(N=10):",round(mean(new),2),"+/-", round(sd(new),2)) ##producing output new with mean and standard deviation are decimal number.

## New(N=10): 139.41 +/- 5.34

cat("t-value=",t.val(old,new)) #producing output t-value from old and new

## t-value= -2.740403

As such, the value of \(t=-2.74\) does not tell much, but it would tell more if the distribution of t were known under the assumption that both samples come from the same distribution. This distribution can easily be obtained by Monte Carlo simulation as shown with the following R-code. The code generates 100000 t-values by sampling from a normal distribution, \(N(μ=0,σ^2=1)\), with sample size Nold,new=10 and shows the resulting distribution as histogram, figure 3.10. It then counts the number of t-values exeeding of falling short of \(±2.74\) (that is \(N|abs(t)>2.74)\) and reports this as a percentage value, \(α=0.01312\).

t.val <- function(pop1,pop2){    #specify the t function
  x1    <- mean(pop1,na.rm=T)    #specify the x1 value from mean
  x2    <- mean(pop2,na.rm=T)    #specify the x2 value from mean
  n1    <- length(pop1[!is.na(pop1)])  #specify n1 from length
  n2    <- length(pop2[!is.na(pop2)])  #specify n2 from length
  s2.1  <- var(pop1,na.rm=T)  #specify s2.1 from variance
  s2.2  <- var(pop2,na.rm=T)  #specify s2.2 from variance
  
  s.pool <- sqrt( ( (n1-1)*s2.1 +  (n2-1)*s2.2  ) / (n1+n2-2))
  t      <- (x1-x2)/(s.pool*sqrt( 1/n1 + 1/n2  ) )
  return(t) #return from function t
}  #specify the t function

set.seed(1234)  #sets a seed.  It is good to use the same seed all the time.
t <- NULL     # Observations in _each_ group
for (i in 1:100000) {   # Loop through the vector, adding one
  old <- rnorm(10,0,1)  #old is normal distribution with n=10, mean =0, and var = 1
  new   <- rnorm(10,0,1) #new is normal distribution with n=10, mean =0, and var = 1
  t   <- c(t,t.val(old,new)) #Specify the t function
}
hist(t, breaks=60,col="lightblue",main="")
abline(v=c(-2.740403,0,2.740403), 
       lty=c(2,1,2),col=c("red","black","red"))
box() #histogram visualitation

Monte Carlo t-distribution of sample size \(N=10\) obtained by 100000 resampling steps from \(∼N(0,1)\); the test value \(±2.74\) in the present example is marked red in the plot.

prop.table(table(abs(t)>2.740403)) #Probability table from t-value

## 
##   FALSE    TRUE 
## 0.98688 0.01312

Of course, there are ready-made functions for performing t-tests in all statistical software packages, in R, e.g., the base function t.test():

set.seed(1234) #sets a seed.  It is good to use the same seed all the time.
old    <- rnorm(10,135,5) #old is normal distribution with n=10, mean =135, and var = 5
new    <-  rnorm(10,140,5) #new is normal distribution with n=10, mean =140, and var = 5
t.test(old,new, var.equal=T ) #specify the t test value from function old and new

## 
##  Two Sample t-test
## 
## data:  old and new
## t = -2.7404, df = 18, p-value = 0.01344
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.173926  -1.475941
## sample estimates:
## mean of x mean of y 
##  133.0842  139.4091

The t-tests reports an \(α-value^14\) of \(0.01344\) in good agreement with the result obtained from Monte Carlo simulation. In addition the parameter df in the output reports the degrees of freedom, \(df=18\), of the test. The t-test is based on two “internal” parameters,\(\bar{X}_1\) and \(\bar{X}_2\) and these two parameters must be deducted from the total sample size \(N=20\), hence \(df=1815\). By convention a test result is called significant in statistics when \(0.01≤α≤0.05\) and highly significant when \(α<0.01\). When a test result is jugded significant given α, H0 is rejected in favour of H1 while silently accepting that in \(100∗α%\) of the cases \(H0\) will be true.

pwr::pwr.t.test(    # Power analysis, t-test
                n = NULL,   # Observations in _each_ group
                d = 1,
                sig.level = 0.05,   #Type I probability
power = 0.9,alternative = c("greater"))  #alternative hypothesis

## 
##      Two-sample t test power calculation 
## 
##               n = 17.84712
##               d = 1
##       sig.level = 0.05
##           power = 0.9
##     alternative = greater
## 
## NOTE: n is number in *each* group

Least Squares problems can be solved conveniently with statistical software, making Ordinary Least Squares (OLS) to the most widely used statistical method in science. In R the lm() function provides all functionalities for solving OLS problems, e.g. with the R example code

###### plot SS 
set.seed(1234)  #sets a seed.  It is good to use the same seed all the time.
x <- seq(0,5,length=30)
y <- 3 + 5*x + rnorm(length(x),0,1)
x.df <- data.frame(y,x) # make a data.frame
res <- lm(y~x,data=x.df) # do linear modeling
summary(res) # show results

## 
## Call:
## lm(formula = y ~ x, data = x.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0306 -0.5751 -0.2109  0.5522  2.7050 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.6800     0.3273   8.188 6.51e-09 ***
## x             5.0094     0.1124  44.562  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9189 on 28 degrees of freedom
## Multiple R-squared:  0.9861, Adjusted R-squared:  0.9856 
## F-statistic:  1986 on 1 and 28 DF,  p-value: < 2.2e-16

a1 <- cov(x,y)/var(x)              # analytical OLS solution 
a0 <- mean(y) - a1*mean(x)
c(a0=a0,a1=a1)                     # show OLS estimators a0,a1

##       a0       a1 
## 2.680009 5.009426

cor(x.df)                          # shows the Pearson  r.xy

##           y         x
## y 1.0000000 0.9930235
## x 0.9930235 1.0000000

sqrt(sum((y - predict(res))^2)/28) # residual standard error

## [1] 0.9188509

cor(x.df$y,predict(res))^2         # R2-value #1: cor(y,y.hat)^2;

## [1] 0.9860958

ss <- anova(res)      #Anova from res function
names(ss)

## [1] "Df"      "Sum Sq"  "Mean Sq" "F value" "Pr(>F)"

r2 <- ss[,2][1]/sum(ss[,2]) 
r2                              # R2-value #2: SS.model/ss.total

## [1] 0.9860958

F.test <- ss[,3][1]/ss[,3][2]   # Calculate F-value 
F.test

## [1] 1985.773

1-pf(F.test,ss[1,1],ss[1,2] )      # Prob(F>F.test)|H0

## [1] 0

The R-code for estimating and testing the contrast (new-old) with OLS then becomes

set.seed(1234)  #sets a seed.  It is good to use the same seed all the time.
rm(list=ls())   #Delete all datasets and variables
set.seed(1234)  #sets a seed.  It is good to use the same seed all the time.
old    <- rnorm(10,135,5)   #old is normal distribution with n=10, mean =135, and var = 5.
new    <-  rnorm(10,140,5)  #new is normal distribution with n=10, mean =140, and var = 5.

x <- data.frame(obsnr=1:20,treat=c(rep("old",10), 
                      rep("new",10) ), y=round(c(old,new),2))
levels(x$treat) # wrong order with "new" being the reference

## NULL

#x$treat <- relevel(x$treat,ref="old") #.. so relevel
#levels(x$treat)

summary(lm(y~treat,data=x)) #summary from linear model

## 
## Call:
## lm(formula = y ~ treat, data = x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.815 -3.365 -0.930  3.495 12.672 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  139.408      1.632  85.404   <2e-16 ***
## treatold      -6.323      2.308  -2.739   0.0135 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.162 on 18 degrees of freedom
## Multiple R-squared:  0.2942, Adjusted R-squared:  0.255 
## F-statistic: 7.502 on 1 and 18 DF,  p-value: 0.01348

Outliers are a reality in empirical data and might reflect a lack of experimental control or alternatively model inadequacy, i.e. a non-linearity not approriately described by the empirical model. Depending on the objective of the \(project^20\), outlier should be checked by repeating the outlying experiments.

set.seed(1234)   #sets a seed.  It is good to use the same seed all the time.
x <- seq(0,5,length=30)  #spesify funtion of x 
XX <- as.matrix(cbind(1,x))  #spesify funtion of xx 
y <- 3 + 5*x + rnorm(length(x),0,5)  #spesify funtion of y
y[x > 3.8 & x < 4.2] <- y[x > 3.8 & x < 4.2] - 25
res <- lm(y~x)  #spesify funtion of res from linear model x and y 

layout(matrix(c(1,2,3,4), 2, 2, byrow = TRUE))
plot(x,y,type="n",pch=16, ylim=c(-20,40), main="y~x")
text(x,y,label=1:length(y))
abline(lm(y~x)$coef)
grid()
plot(predict(res), rstudent(res),xlab=expression( paste("predicted response ", hat(y)) ) ,ylim=c(-5,3)   ,
     ylab="studentized residuals",type="n",
     main=expression(paste("(",frac(paste("y - ", hat(y)),sigma), ") ~ ", hat(y)   )) )

text(predict(res), rstudent(res),label=1:length(y))
abline(h=c(-2,0,2),lty=c(2,1,2))
grid()
plot(1:length(y), rstudent(res),xlab="run order",ylab="studentized residuals",type="l",
     main=expression(paste("(",frac(paste("y - ", hat(y)),sigma), ") ~ run order"   )))  #create plot visualitation
abline(h=c(-2,0,2),lty=c(2,1,2)) #create abline in visualitation
grid()
acf(rstudent(res), main="Autocorrelation function of residuals") #show auto-correlation function plot

Effects of outlying observations on OLS estimation.

summary(res)  #summary from res

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.4949  -1.3215   0.6568   3.2538  16.0924 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.6366     2.6802   0.984 0.333666    
## x             3.8858     0.9205   4.221 0.000232 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.524 on 28 degrees of freedom
## Multiple R-squared:  0.3889, Adjusted R-squared:  0.3671 
## F-statistic: 17.82 on 1 and 28 DF,  p-value: 0.0002315

An introduction into linear statistic

Widi yantih

November 16, 2020