Two introductory examples

Evaluating equation (3.4) with \(N1,2,\hat{x}_{1,2} and s_{1,2}\) gives a t-value of \(t_{1,2}=-2.740403\). The following R-snippet shows how the two samples were generated and how the t-value was calculated with the user function \(“t.val()”\) (note that both samples truly differ by \(\Delta\mu=5\) with “true” uncertainty \(\sigma=5\)).

rm(list=ls())                                                   # Delete all datasets and variables
set.seed(1234)                                                  # sets a seed.  It is good to use the same seed all the time.
t.val <- function(pop1,pop2){                                   # specify the t value
  x1    <- mean(pop1,na.rm=T)                                   # specify the x1 value from mean
  x2    <- mean(pop2,na.rm=T)                                   # specify the x2 value from mean
  n1    <- length(pop1[!is.na(pop1)])                           # specify n1 from length
  n2    <- length(pop2[!is.na(pop2)])                           # specify n2 from length
  s2.1  <- var(pop1,na.rm=T)                                    # specify s2.1 from variance
  s2.2  <- var(pop2,na.rm=T)                                    # specify s2.2 from variance
  s.pool <- sqrt( ( (n1-1)*s2.1 +  (n2-1)*s2.2  ) / (n1+n2-2))
  t      <- (x1-x2)/(s.pool*sqrt( 1/n1 + 1/n2  ) )
  return(t)
}                                                               # specify the t value
old    <- rnorm(10,135,5)                                       # old is normal distribution with n=10, mean =135 and var = 5
new    <-  rnorm(10,140,5)                                      # new is normal distribution with n=10, mean =140 and var = 5
cat("Old(N=10):",round( mean(old),2),"+/-", round(sd(old),2))   # producing output old, with mean and standard deviation are decimal number.

## Old(N=10): 133.08 +/- 4.98

cat("New(N=10):",round(mean(new),2),"+/-", round(sd(new),2))    # producing output new with mean and standard deviation are decimal number.

## New(N=10): 139.41 +/- 5.34

cat("t-value=",t.val(old,new))                                  # producing output t-value from old and new

## t-value= -2.740403

As such, the value of \(t=-2.74\) does not tell much, but it would tell more if the distribution of t were known under the assumption that both samples come from the same distribution. This distribution can easily be obtained by Monte Carlo simulation as shown with the following R-code. The code generates 100000 t-values by sampling from a normal distribution, \(N(\mu=0,\sigma^2=1)\), with sample size Nold,new=10 and shows the resulting distribution as histogram, figure 3.10. It then counts the number of t-values exeeding of falling short of \(\pm 2.74\) (that is \(N|abs(t)>2.74)\) and reports this as a percentage value, \(\alpha=0.01312\).

t.val <- function(pop1,pop2){                                             # specify the t function
  x1    <- mean(pop1,na.rm=T)                                             # specify the x1 value from mean
  x2    <- mean(pop2,na.rm=T)                                             # specify the x2 value from mean
  n1    <- length(pop1[!is.na(pop1)])                                     # specify n1 from length
  n2    <- length(pop2[!is.na(pop2)])                                     # specify n2 from length
  s2.1  <- var(pop1,na.rm=T)                                              # specify s2.1 from variance
  s2.2  <- var(pop2,na.rm=T)                                              # specify s2.2 from variance
  
  s.pool <- sqrt( ( (n1-1)*s2.1 +  (n2-1)*s2.2  ) / (n1+n2-2))
  t      <- (x1-x2)/(s.pool*sqrt( 1/n1 + 1/n2  ) )
  return(t)                                                               # return from function t
}                                                                         # specify the t function

set.seed(1234)                                                            # sets a seed.  It is good to use the same seed all the time.
t <- NULL                                                                 # Observations in _each_ group
for (i in 1:100000) {                                                     # Loop through the vector, adding one
  old <- rnorm(10,0,1)                                                    # old is normal distribution with n=10, mean =0, and var = 1
  new   <- rnorm(10,0,1)                                                  # new is normal distribution with n=10, mean =0, and var = 1
  t   <- c(t,t.val(old,new))                                              # Specify the t function
}
hist(t, breaks=60,col="lightblue",main="")
abline(v=c(-2.740403,0,2.740403), 
       lty=c(2,1,2),col=c("red","black","red"))
box()                                                                     # histogram visualitation

Monte Carlo t-distribution of sample size \(N=10\) obtained by 100000 resampling steps from \(\sim N(0,1)\); the test value \(\pm 2.74\) in the present example is marked red in the plot.

prop.table(table(abs(t)>2.740403))                                        # Probability table from t-value

## 
##   FALSE    TRUE 
## 0.98688 0.01312

Of course, there are ready-made functions for performing t-tests in all statistical software packages, in R, e.g., the base function t.test():

set.seed(1234)                                                            # sets a seed.  It is good to use the same seed all the time.
old    <- rnorm(10,135,5)                                                 # old is normal distribution with n=10, mean =135, and var = 5
new    <-  rnorm(10,140,5)                                                # new is normal distribution with n=10, mean =140, and var = 5
t.test(old,new, var.equal=T )                                             # specify the t test value from function old and new

## 
##  Two Sample t-test
## 
## data:  old and new
## t = -2.7404, df = 18, p-value = 0.01344
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.173926  -1.475941
## sample estimates:
## mean of x mean of y 
##  133.0842  139.4091

The t-tests reports an \(\alpha-value^14\) of \(0.01344\) in good agreement with the result obtained from Monte Carlo simulation. In addition the parameter df in the output reports the degrees of freedom, \(df=18\), of the test. The t-test is based on two “internal” parameters,\(\bar{X}_1\) and \(\bar{X}_2\) and these two parameters must be deducted from the total sample size \(N=20\), hence \(df=1815\). By convention a test result is called significant in statistics when \(0.01≤\alpha≤0.05\) and highly significant when \(\alpha<0.01\). When a test result is jugded significant given α, H0 is rejected in favour of H1 while silently accepting that in \(100∗\alpha%\) of the cases \(H0\) will be true.

pwr::pwr.t.test(                                                         # Power analysis, t-test
                n = NULL,                                                # Observations in _each_ group
                d = 1,
                sig.level = 0.05,                                        # Type I probability
power = 0.9,alternative = c("greater"))                                  # alternative hypothesis

## 
##      Two-sample t test power calculation 
## 
##               n = 17.84712
##               d = 1
##       sig.level = 0.05
##           power = 0.9
##     alternative = greater
## 
## NOTE: n is number in *each* group

Least Squares problems can be solved conveniently with statistical software, making Ordinary Least Squares (OLS) to the most widely used statistical method in science. In R the lm() function provides all functionalities for solving OLS problems, e.g. with the R example code

###### plot SS 
set.seed(1234)                                                          # sets a seed.  It is good to use the same seed all the time.
x <- seq(0,5,length=30)
y <- 3 + 5*x + rnorm(length(x),0,1)
x.df <- data.frame(y,x)                                                 # make a data.frame
res <- lm(y~x,data=x.df)                                                # do linear modeling
summary(res)                                                            # show results

## 
## Call:
## lm(formula = y ~ x, data = x.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0306 -0.5751 -0.2109  0.5522  2.7050 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.6800     0.3273   8.188 6.51e-09 ***
## x             5.0094     0.1124  44.562  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9189 on 28 degrees of freedom
## Multiple R-squared:  0.9861, Adjusted R-squared:  0.9856 
## F-statistic:  1986 on 1 and 28 DF,  p-value: < 2.2e-16

a1 <- cov(x,y)/var(x)                                                   # analytical OLS solution 
a0 <- mean(y) - a1*mean(x)
c(a0=a0,a1=a1)                                                          # show OLS estimators a0,a1

##       a0       a1 
## 2.680009 5.009426

cor(x.df)                                                               # shows the Pearson  r.xy

##           y         x
## y 1.0000000 0.9930235
## x 0.9930235 1.0000000

sqrt(sum((y - predict(res))^2)/28)                                      # residual standard error

## [1] 0.9188509

cor(x.df$y,predict(res))^2                                              # R2-value #1: cor(y,y.hat)^2;

## [1] 0.9860958

ss <- anova(res)                                                        # Anova from res function
names(ss)

## [1] "Df"      "Sum Sq"  "Mean Sq" "F value" "Pr(>F)"

r2 <- ss[,2][1]/sum(ss[,2]) 
r2                                                                      # R2-value #2: SS.model/ss.total

## [1] 0.9860958

F.test <- ss[,3][1]/ss[,3][2]                                           # Calculate F-value 
F.test

## [1] 1985.773

1-pf(F.test,ss[1,1],ss[1,2] )                                           # Prob(F>F.test)|H0

## [1] 0

Since, the p-value is 0, then we reject \(H_0\). Hence, there is \(a_0\ne 0\) and \(a_1\ne 0\) and a significant relationship between variables in the linear regression model of the dataset.

The R-code for estimating and testing the contrast (new-old) with OLS then becomes

set.seed(1234)                                                         # sets a seed.  It is good to use the same seed all the time.
rm(list=ls())                                                          # Delete all datasets and variables
set.seed(1234)                                                         # sets a seed.  It is good to use the same seed all the time.
old    <- rnorm(10,135,5)                                              # old is normal distribution with n=10, mean =135, and var = 5.
new    <-  rnorm(10,140,5)                                             # new is normal distribution with n=10, mean =140, and var = 5.

x <- data.frame(obsnr=1:20,treat=c(rep("old",10), 
                      rep("new",10) ), y=round(c(old,new),2))
levels(x$treat)                                                        # wrong order with "new" being the reference

## NULL

#x$treat <- relevel(x$treat,ref="old") #.. so relevel
#levels(x$treat)

summary(lm(y~treat,data=x))                                            # summary from linear model

## 
## Call:
## lm(formula = y ~ treat, data = x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.815 -3.365 -0.930  3.495 12.672 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  139.408      1.632  85.404   <2e-16 ***
## treatold      -6.323      2.308  -2.739   0.0135 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.162 on 18 degrees of freedom
## Multiple R-squared:  0.2942, Adjusted R-squared:  0.255 
## F-statistic: 7.502 on 1 and 18 DF,  p-value: 0.01348

The null hypothesis: \(a_0 =0\) and \(a_1 =0\)

The alternative hypothesis: \(a_0\ne 0\) and \(a_1\ne 0\)

So, since the p-value is lesser than 0.05, then we reject \(H_0\). Hence, in conclusion, there is \(a_0\ne 0\) and \(a_1\ne 0\)

Outliers are a reality in empirical data and might reflect a lack of experimental control or alternatively model inadequacy, i.e. a non-linearity not approriately described by the empirical model. Depending on the objective of the \(project^20\), outlier should be checked by repeating the outlying experiments.

set.seed(1234)                                                         # sets a seed.  It is good to use the same seed all the time.
x <- seq(0,5,length=30)                                                # specify function of x 
XX <- as.matrix(cbind(1,x))                                            # specify function of xx 
y <- 3 + 5*x + rnorm(length(x),0,5)                                    # specify function of y
y[x > 3.8 & x < 4.2] <- y[x > 3.8 & x < 4.2] - 25
res <- lm(y~x)                                                         # specify function of res from linear model x and y 

layout(matrix(c(1,2,3,4), 2, 2, byrow = TRUE))
plot(x,y,type="n",pch=16, ylim=c(-20,40), main="y~x")
text(x,y,label=1:length(y))
abline(lm(y~x)$coef)
grid()
plot(predict(res), rstudent(res),xlab=expression( paste("predicted response ", hat(y)) ) ,ylim=c(-5,3)   ,
     ylab="studentized residuals",type="n",
     main=expression(paste("(",frac(paste("y - ", hat(y)),sigma), ") ~ ", hat(y)   )) )

text(predict(res), rstudent(res),label=1:length(y))
abline(h=c(-2,0,2),lty=c(2,1,2))
grid()
plot(1:length(y), rstudent(res),xlab="run order",ylab="studentized residuals",type="l",
     main=expression(paste("(",frac(paste("y - ", hat(y)),sigma), ") ~ run order"   )))  # create plot visualitation
abline(h=c(-2,0,2),lty=c(2,1,2))                                                         # create abline in visualitation
grid()
acf(rstudent(res), main="Autocorrelation function of residuals")                         # show auto-correlation function plot

Effects of outlying observations on OLS estimation.

summary(res)                                                         # summary from res

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.4949  -1.3215   0.6568   3.2538  16.0924 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.6366     2.6802   0.984 0.333666    
## x             3.8858     0.9205   4.221 0.000232 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.524 on 28 degrees of freedom
## Multiple R-squared:  0.3889, Adjusted R-squared:  0.3671 
## F-statistic: 17.82 on 1 and 28 DF,  p-value: 0.0002315

set.seed(1234)
x <- seq(0,5,length=30)
y <- 3 + 5*x + 5*arima.sim(list(ar=c(0.95)),30)
res <- lm(y~x)
layout(matrix(c(1,2,3,4), 2, 2, byrow = TRUE))
plot(x,y,type="n",pch=16, ylim=c(0,50), main="y~x")
text(x,y,label=1:length(y))
abline(lm(y~x)$coef)
grid()
plot(predict(res), rstudent(res),xlab=expression( paste("predicted response ", hat(y)) ) ,ylim=c(-3,3)   ,
     ylab="studentized residuals",type="n",
     main=expression(paste("(",frac(paste("y - ", hat(y)),sigma), ") ~ ", hat(y)   )) )
text(predict(res), rstudent(res),label=1:length(y))
abline(h=c(-2,0,2),lty=c(2,1,2))
grid()
plot(1:length(y), rstudent(res),xlab="run order",ylab="studentized residuals",type="l",ylim=c(-3,3)   ,
     main=expression(paste("(",frac(paste("y - ", hat(y)),sigma), ") ~ run order"   )))
abline(h=c(-2,0,2),lty=c(2,1,2))
grid()
acf(rstudent(res), main="Autocorrelation function of residuals")

summary(res)                                                       # summary from res

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.3957  -5.2302  -0.2657   4.9561  14.1255 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   10.152      2.960   3.430  0.00189 **
## x              2.988      1.017   2.939  0.00653 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.31 on 28 degrees of freedom
## Multiple R-squared:  0.2357, Adjusted R-squared:  0.2084 
## F-statistic: 8.637 on 1 and 28 DF,  p-value: 0.006531

The purpose of using summary() function is to find more information about the data. The information provide us that the residuals have minimum, 1Q, median, 3Q, and maximum value which are -15.3957, -5.2302, -0.2657, 4.9561, and 14.1255.

Since the null and alternative hypothesis are

The null hypothesis: \(a_0 =0\) and \(a_1 =0\)

The alternative hypothesis: \(a_0\ne 0\) and \(a_1\ne 0\)

p value is lesser than 0.05 so we we reject \(H_0\). Hence, in conclusion, there is \(a_0\ne 0\) and \(a_1\ne 0\)

Two introductory examples

Jerrel

Nopember 18, 2020