An Introduction Into Linear Statistic

Evaluating equation (3.4) with \(N1,2,\hat{x}_{1,2}\) and \(s_{1,2}\) gives a t-value of \(t_{1,2}=-2.740403\). The following R-snippet shows how the two samples were generated and how the t-value was calculated with the user function \(“t.val()”\) (note that both samples truly differ by \(Δμ=5\) with “true” uncertainty \(σ=5\)).

rm(list=ls())  # Delete all datasets and variables
set.seed(1234) # sets a seed.  It is good to use the same seed all the time.
t.val <- function(pop1,pop2){   #specify the t value
  x1    <- mean(pop1,na.rm=T)   #specify the x1 value from mean
  x2    <- mean(pop2,na.rm=T)   #specify the x2 value from mean
  n1    <- length(pop1[!is.na(pop1)])  #specify n1 from length
  n2    <- length(pop2[!is.na(pop2)])  #specify n2 from length
  s2.1  <- var(pop1,na.rm=T)  #specify s2.1 from variance
  s2.2  <- var(pop2,na.rm=T)  #specify s2.2 from variance
  s.pool <- sqrt( ( (n1-1)*s2.1 +  (n2-1)*s2.2  ) / (n1+n2-2))
  t      <- (x1-x2)/(s.pool*sqrt( 1/n1 + 1/n2  ) )
  return(t)
}   #specify the t value
old    <- rnorm(10,135,5) #old is normal distribution with n=10, mean =135 and var = 5
new    <-  rnorm(10,140,5) #new is normal distribution with n=10, mean =140 and var = 5
cat("Old(N=10):",round( mean(old),2),"+/-", round(sd(old),2)) #producing output old, with mean and standard deviation are decimal number.

## Old(N=10): 133.08 +/- 4.98

Summary : The value of the function old with N=10 has a mean of 133.08 and a standard deviation of approximately 4.98

cat("New(N=10):",round(mean(new),2),"+/-", round(sd(new),2)) ##producing output new with mean and standard deviation are decimal number.

## New(N=10): 139.41 +/- 5.34

Summary : The value of the function new with N=10 has a mean of 139.41 and a standard deviation of approximately 5.34 greater than old function.

cat("t-value=",t.val(old,new)) #producing output t-value from old and new

## t-value= -2.740403

As such, the value of \(t=-2.74\) does not tell much, but it would tell more if the distribution of t were known under the assumption that both samples come from the same distribution. This distribution can easily be obtained by Monte Carlo simulation as shown with the following R-code. The code generates 100000 t-values by sampling from a normal distribution, \(N(μ=0,σ^2=1)\), with sample size Nold,new=10 and shows the resulting distribution as histogram, figure 3.10. It then counts the number of t-values exeeding of falling short of \(±2.74\) (that is \(N|abs(t)>2.74)\) and reports this as a percentage value, \(α=0.01312\).

t.val <- function(pop1,pop2){    #specify the t function
  x1    <- mean(pop1,na.rm=T)    #specify the x1 value from mean
  x2    <- mean(pop2,na.rm=T)    #specify the x2 value from mean
  n1    <- length(pop1[!is.na(pop1)])  #specify n1 from length
  n2    <- length(pop2[!is.na(pop2)])  #specify n2 from length
  s2.1  <- var(pop1,na.rm=T)  #specify s2.1 from variance
  s2.2  <- var(pop2,na.rm=T)  #specify s2.2 from variance
  
  s.pool <- sqrt( ( (n1-1)*s2.1 +  (n2-1)*s2.2  ) / (n1+n2-2))
  t      <- (x1-x2)/(s.pool*sqrt( 1/n1 + 1/n2  ) )
  return(t) #return from function t
}  #specify the t function

set.seed(1234)  #sets a seed.  It is good to use the same seed all the time.
t <- NULL     # Observations in _each_ group
for (i in 1:100000) {   # Loop through the vector, adding one
  old <- rnorm(10,0,1)  #old is normal distribution with n=10, mean =0, and var = 1
  new   <- rnorm(10,0,1) #new is normal distribution with n=10, mean =0, and var = 1
  t   <- c(t,t.val(old,new)) #Specify the t function
}
hist(t, breaks=60,col="lightblue",main="")
abline(v=c(-2.740403,0,2.740403), 
       lty=c(2,1,2),col=c("red","black","red"))
box() #histogram visualitation

Monte Carlo t-distribution of sample size \(N=10\) obtained by 100000 resampling steps from \(∼N(0,1)\); the test value \(±2.74\) in the present example is marked red in the plot. This plot have normal distribution

prop.table(table(abs(t)>2.740403)) #Probability table from t-value

## 
##   FALSE    TRUE 
## 0.98688 0.01312

Of course, there are ready-made functions for performing t-tests in all statistical software packages, in R, e.g., the base function t.test():

set.seed(1234) #sets a seed.  It is good to use the same seed all the time.
old    <- rnorm(10,135,5) #old is normal distribution with n=10, mean =135, and var = 5
new    <-  rnorm(10,140,5) #new is normal distribution with n=10, mean =140, and var = 5
t.test(old,new, var.equal=T ) #specify the t test value from function old and new

## 
##  Two Sample t-test
## 
## data:  old and new
## t = -2.7404, df = 18, p-value = 0.01344
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.173926  -1.475941
## sample estimates:
## mean of x mean of y 
##  133.0842  139.4091

The t-tests reports an \(α-value^14\) of \(0.01344\) in good agreement with the result obtained from Monte Carlo simulation. In addition the parameter df in the output reports the degrees of freedom, \(df=18\), of the test. The t-test is based on two “internal” parameters,\(\bar{X}_1\) and \(\bar{X}_2\) and these two parameters must be deducted from the total sample size \(N=20\), hence \(df=1815\). By convention a test result is called significant in statistics when \(0.01≤α≤0.05\) and highly significant when \(α<0.01\). When a test result is jugded significant given α, H0 is rejected in favour of H1 while silently accepting that in \(100∗α%\) of the cases \(H0\) will be true.

pwr::pwr.t.test(    # Power analysis, t-test
                n = NULL,   # Observations in _each_ group
                d = 1,
                sig.level = 0.05,   #Type I probability
power = 0.9,alternative = c("greater"))  #alternative hypothesis

## 
##      Two-sample t test power calculation 
## 
##               n = 17.84712
##               d = 1
##       sig.level = 0.05
##           power = 0.9
##     alternative = greater
## 
## NOTE: n is number in *each* group

from power analysis above obtained n=17.84 , d=1 and significant level 0.05, and alternative hypothesis 0.9

Least Squares problems can be solved conveniently with statistical software, making Ordinary Least Squares (OLS) to the most widely used statistical method in science. In R the lm() function provides all functionalities for solving OLS problems, e.g. with the R example code

###### plot SS 
set.seed(1234)  #sets a seed.  It is good to use the same seed all the time.
x <- seq(0,5,length=30)
y <- 3 + 5*x + rnorm(length(x),0,1)
x.df <- data.frame(y,x) # make a data.frame
res <- lm(y~x,data=x.df) # do linear modeling
summary(res) # show results

## 
## Call:
## lm(formula = y ~ x, data = x.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0306 -0.5751 -0.2109  0.5522  2.7050 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.6800     0.3273   8.188 6.51e-09 ***
## x             5.0094     0.1124  44.562  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9189 on 28 degrees of freedom
## Multiple R-squared:  0.9861, Adjusted R-squared:  0.9856 
## F-statistic:  1986 on 1 and 28 DF,  p-value: < 2.2e-16

from summary res obtained residual standard error 0,9189 on degrees of freedom 28 , and multiple R-square 0.9189, p-value greater than significant level so that accept null hypothesis

a1 <- cov(x,y)/var(x)              # analytical OLS solution 
a0 <- mean(y) - a1*mean(x)
c(a0=a0,a1=a1)                     # show OLS estimators a0,a1

##       a0       a1 
## 2.680009 5.009426

a1 is 5.009 and a0 is 2.68 , so that a1 greater than a0

cor(x.df)                          # shows the Pearson  r.xy

##           y         x
## y 1.0000000 0.9930235
## x 0.9930235 1.0000000

it show correlation dari x and y variable

sqrt(sum((y - predict(res))^2)/28) # residual standard error

## [1] 0.9188509

Determine residual standar error from res function that is 0.9188

cor(x.df$y,predict(res))^2         # R2-value #1: cor(y,y.hat)^2;

## [1] 0.9860958

correlation from y hat

ss <- anova(res)      #Anova from res function
ss

## Analysis of Variance Table
## 
## Response: y
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## x          1 1676.56 1676.56  1985.8 < 2.2e-16 ***
## Residuals 28   23.64    0.84                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

names(ss)

## [1] "Df"      "Sum Sq"  "Mean Sq" "F value" "Pr(>F)"

ANOVA table from res

r2 <- ss[,2][1]/sum(ss[,2]) 
r2                              # R2-value #2: SS.model/ss.total

## [1] 0.9860958

value of R-square 0.98 equal to 98% so that independent variable affect to the dependent variable.

F.test <- ss[,3][1]/ss[,3][2]   # Calculate F-value 
F.test

## [1] 1985.773

we have F-value 1985.8 and F.test 1985.773 so that F-value greater than F-test

1-pf(F.test,ss[1,1],ss[1,2] )      # Prob(F>F.test)|H0

## [1] 0

above probability of F value greater than F-test so that accept H0 The R-code for estimating and testing the contrast (new-old) with OLS then becomes

set.seed(1234)  #sets a seed.  It is good to use the same seed all the time.
rm(list=ls())   #Delete all datasets and variables
set.seed(1234)  #sets a seed.  It is good to use the same seed all the time.
old    <- rnorm(10,135,5)   #old is normal distribution with n=10, mean =135, and var = 5.
new    <-  rnorm(10,140,5)  #new is normal distribution with n=10, mean =140, and var = 5.

x <- data.frame(obsnr=1:20,treat=c(rep("old",10), 
                      rep("new",10) ), y=round(c(old,new),2))
levels(x$treat) # wrong order with "new" being the reference

## NULL

#x$treat <- relevel(x$treat,ref="old") #.. so relevel
#levels(x$treat)

summary(lm(y~treat,data=x)) #summary from linear model

## 
## Call:
## lm(formula = y ~ treat, data = x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.815 -3.365 -0.930  3.495 12.672 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  139.408      1.632  85.404   <2e-16 ***
## treatold      -6.323      2.308  -2.739   0.0135 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.162 on 18 degrees of freedom
## Multiple R-squared:  0.2942, Adjusted R-squared:  0.255 
## F-statistic: 7.502 on 1 and 18 DF,  p-value: 0.01348

Above summary of Linear model , we get residual standard error 5.162 on 18 degree of freedom and R-Square 0.29 equal to 29%

Outliers are a reality in empirical data and might reflect a lack of experimental control or alternatively model inadequacy, i.e. a non-linearity not approriately described by the empirical model. Depending on the objective of the \(project^20\), outlier should be checked by repeating the outlying experiments.

set.seed(1234)   #sets a seed.  It is good to use the same seed all the time.
x <- seq(0,5,length=30)  #spesify funtion of x 
XX <- as.matrix(cbind(1,x))  #spesify funtion of xx 
y <- 3 + 5*x + rnorm(length(x),0,5)  #spesify funtion of y
y[x > 3.8 & x < 4.2] <- y[x > 3.8 & x < 4.2] - 25
res <- lm(y~x)  #spesify funtion of res from linear model x and y 

layout(matrix(c(1,2,3,4), 2, 2, byrow = TRUE))
plot(x,y,type="n",pch=16, ylim=c(-20,40), main="y~x")
text(x,y,label=1:length(y))
abline(lm(y~x)$coef)
grid()
plot(predict(res), rstudent(res),xlab=expression( paste("predicted response ", hat(y)) ) ,ylim=c(-5,3)   ,
     ylab="studentized residuals",type="n",
     main=expression(paste("(",frac(paste("y - ", hat(y)),sigma), ") ~ ", hat(y)   )) )

text(predict(res), rstudent(res),label=1:length(y))
abline(h=c(-2,0,2),lty=c(2,1,2))
grid()
plot(1:length(y), rstudent(res),xlab="run order",ylab="studentized residuals",type="l",
     main=expression(paste("(",frac(paste("y - ", hat(y)),sigma), ") ~ run order"   )))  #create plot visualitation
abline(h=c(-2,0,2),lty=c(2,1,2)) #create abline in visualitation
grid()
acf(rstudent(res), main="Autocorrelation function of residuals") #show auto-correlation function plot

Effects of outlying observations on OLS estimation.

summary(res)  #summary from res

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.4949  -1.3215   0.6568   3.2538  16.0924 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.6366     2.6802   0.984 0.333666    
## x             3.8858     0.9205   4.221 0.000232 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.524 on 28 degrees of freedom
## Multiple R-squared:  0.3889, Adjusted R-squared:  0.3671 
## F-statistic: 17.82 on 1 and 28 DF,  p-value: 0.0002315

above show a linear model of x and y, we get residual standar error 7.524 on 28 degrees of freedom and Multiple R-squared: 0.3889 equal to 38% so that independent variable doesn’t really affect to dependent variable / weak corrrelation, so that refuse H0

An Introduction Into Linear Statistic

Widi yantih

November 18, 2020