Linear regression by direct calculations

Using a chemical procedure called differential pulse polarography, a chemist measured the peak current generated (in microamperes, \(\mu\)A) when solutions containing different amounts of nickel (measured in parts per billion, ppb) are added to different portions of the same buffer.

First, let us input the response and explanatory variables and let us create a scatter plot to see if there is a dependence:

y = c(0.095,0.174,0.256,0.348,0.429,0.500,0.580,0.651,0.722)

x = c(19.1,38.2,57.3,76.2,95,114,131,150,170)

plot(x,y,ylab="peak current",xlab="amounts of nickel")

Now let us calculate the estimate of \(\beta_1\)

n = length(x)
Sxx = sum((x - mean(x))^2)
Sxy = sum((x - mean(x))*(y - mean(y)))
beta1hat = Sxy/Sxx
print(Sxx)
## [1] 21066.82
print(Sxy)
## [1] 88.80003
print(beta1hat)
## [1] 0.004215161

Now, suppose we want to test the null hypothesis that \(\beta_1 = 0\).

In order to do the test we need to estimate \(\sigma\). For this purpose we calculate SSE. (We also illustrate another formula for SSE: Syy - Sxy^2/Sxx)

SSE = sum((y - mean(y) - beta1hat *(x - mean(x)))^2)
SSE
## [1] 0.0004911382
Syy = sum((y - mean(y))^2)
SSE = Syy - Sxy^2/Sxx
SSE
## [1] 0.0004911382
sigma_hat = sqrt(SSE/(n - 2))
sigma_hat
## [1] 0.008376312

Now we can construct the test statistic

T = (beta1hat - 0)/(sigma_hat/sqrt(Sxx))
T
## [1] 73.04001

Here it is obvious that null hypothesis can be rejected. Otherwise, we would need to calculate a p-value by using the cumulative distribution function of the test statistic.

2* pt(T,n-2,lower.tail = F)
## [1] 2.371798e-11

Linear regression via call to “lm” function

All these calculations have been preprogrammed in R, though.

model.lm = lm(y~x)
smry = summary(model.lm)
print(smry) 
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0133264 -0.0042777 -0.0000231  0.0080557  0.0098107 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.875e-02  6.129e-03   3.059   0.0183 *  
## x           4.215e-03  5.771e-05  73.040 2.37e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.008376 on 7 degrees of freedom
## Multiple R-squared:  0.9987, Adjusted R-squared:  0.9985 
## F-statistic:  5335 on 1 and 7 DF,  p-value: 2.372e-11

You can see many familiar numbers in the R output. Forr example \(\sigma_hat\) is calculated unde the name residual standard error, so we could calculate SSE as this number multiplied by (n - 2). Alternatively, it is useful to know that we can extract fitted values directly from the output:

SSE_check = sum((y - fitted(model.lm))^2) 
SSE
## [1] 0.0004911382
SSE_check
## [1] 0.0004911382

You ran also see that R calculated the t-statistic and p-values needed to test the hypothesis that a parameter is equal to zero. If the p-value less that \(5\%\), R denotes this parameter with a star, if it is smaller than \(1\%\), then the parameter gets 2 stars and our slope parameter got 3 stars as very highly significant.

(Note that if you wanted to test one sided hypothesis, then you would need to divide the p-value by 2.)

If we want to test if \(\beta_1\) is equal to some other value, for example 0.004, we can either extract the coefficients and standard errors from the regression output, calculate the statistic and p-value

z = smry$coefficients
z
##                Estimate   Std. Error   t value     Pr(>|t|)
## (Intercept) 0.018749024 0.0061285265  3.059304 1.834140e-02
## x           0.004215161 0.0000577103 73.040008 2.371798e-11
T = (z[2,1] - 0.004)/z[2,2]
T
## [1] 3.728291
2*pt(T, n-2, lower.tail = F)
## [1] 0.007375271

Alternatively, we can re-parameterize the model and run the linear regression again:

summary(lm((y - 0.004*x ~ x)))
## 
## Call:
## lm(formula = (y - 0.004 * x ~ x))
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0133264 -0.0042777 -0.0000231  0.0080557  0.0098107 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 1.875e-02  6.129e-03   3.059  0.01834 * 
## x           2.152e-04  5.771e-05   3.728  0.00738 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.008376 on 7 degrees of freedom
## Multiple R-squared:  0.6651, Adjusted R-squared:  0.6172 
## F-statistic:  13.9 on 1 and 7 DF,  p-value: 0.007375

Note that this gives the same p-value \(0.00738\)