REGRESSION ANALYSIS USING R
We start b reading the regression file which details how weight gain (mg) of individual caterpillars (i.e., growth) declines as the tannin content of their diet (%) increases.
regression = read.csv("C:\\temp\\Course Data\\regression.csv",header = TRUE)
attach(regression)
names(regression)
## [1] "growth" "tannin"
plot(tannin,growth)
Next, we try to fit a horizontal line in this plot.
plot(tannin,growth)
abline(mean(growth),0)
The scatter around this line (defined by y bar) is called the total variation in y. Each line is at a distance = (y-y bar) from the line, and the total variation in y is defined as the sum of squares of all these distances (SST - difference b/w observed value of dependent variable and its mean).
Now, we try to find the best position of this line (location as well as slope) , i.e., the position with the least SST (minimizing the variation in y). For location, we know that for the best fit, this line should pass through the mean values of x and y, i.e., it should pass through (x bar, y bar). Then, we can rotate it to rest at the most optimal slope value.
How to rotate? Imagine that we have found the fitted line. Now, each predicted value of y (=y cap, derived from our regression equation: “y cap=a+bx”) will be at distance of e(=y-y cap) from the fitted line. The total sum of these distance is known as the SSE (Sum of Squares Errors - difference b/w observed and predicted values of dependent variable).
When we start rotating our horizontal line, SSE first declines as the slope of our line approaches that of the best-fit line, followed by an increase again as we move past the best fit line slope. Our objective, therefore, is to minimize SSE by finding the derivative of SSE w.r.t. b (slope), equate it to zero (since we are looking for the flattest part of this u-shaped curve) and find b at that point.
ANALYSIS OF VARIANCE IN REGRESSION
Goal = partition SST into SSR and SSE (SST = SSR + SSE). SSE is zero and SSR=SST when the fit of the line is perfect.SSR = 0 and SST=SSE when y is independent of x.
Variation in SSE/Error variance is known as MSE (Mean Square Errors) and is the quantity used in calculating standard errors and confidence intervals for the parameters, and in carrying out hypothesis testing.
95% CI for slope/b (=tStandard Error of b) with alpha=0.025 (p=1-alpha) and d.f.=n-2 95% CI for intercept/a (=tStandard Error of a) with alpha=0.025 and d.f.=n-2 t values can be calculated using the qt()
qt(0.975,7)
## [1] 2.364624
CALCULATIONS INVOLVED IN LINEAR REGRESSION
regression
## growth tannin
## 1 12 0
## 2 10 1
## 3 8 2
## 4 11 3
## 5 6 4
## 6 7 5
## 7 2 6
## 8 3 7
## 9 3 8
Response Variable = growth Explanatory variable = tannin
Calculating sigma x, sigma x sq, sigma y, sigma y sq, sigma xy
sum(tannin);sum(tannin^2)
## [1] 36
## [1] 204
sum(growth);sum(growth^2)
## [1] 62
## [1] 536
sum(tannin*growth)
## [1] 175
Calculating SST, SSX, SSXY using above values
SST = sum(growth^2)-(sum(growth)/9)
SSX = sum(tannin^2) -(sum(tannin)/9)
SSXY = sum(tannin*growth)-((sum(tannin)*sum(growth))/9)
b = SSXY/SSX
a = (sum(growth)/9)+b*(sum(tannin)/9)
SSR = b*SSXY
SSE = SST - SSR
qf(0.95,1,7)
## [1] 5.591448
1-pf(30.98,1,7)
## [1] 0.0008455934
SSR;SSE;SST
## [1] 26.645
## [1] 502.4661
## [1] 529.1111
Since the probability of getting F=30.98 is extremely low, i.e., 0.00084, we reject the null hyp that slope of the regression line is zero. Next we calculate the std errors of the slope and the intercept:
SEofA=(2.867/60)^(1/2)
SEofB=(2.867*204/(9*60))^(1/2)
SEofA;SEofB
## [1] 0.218594
## [1] 1.040716
Therefore, 95% CI for slope and intercept are given by: a = 11.756 +- 2.463 b = -1.21666 +- 0.5169
USING R for Regression - data inspection - model specification - model fitting - model criticism.
First step is to get to know the data. Here, we plot the response variable against the explanatory variable to check: - is there a trend in the data? - what is the slope of the trend (positive or negative)? - is the trend linear or curved? - is there any pattern to the scatter around the trend?
Below we use lm() - A typical model has the form (response ~ terms) where response is the (numeric) response vector and terms is a series of terms which specifies a linear predictor for response (explanatory variable).
plot(tannin,growth)
abline(lm(growth~tannin))
From a statistical modeling POV, we define and store this model in object named model. Then, we summarize it.
model=lm(growth~tannin)
summary(model)
##
## Call:
## lm(formula = growth ~ tannin)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4556 -0.8889 -0.2389 0.9778 2.8944
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.7556 1.0408 11.295 9.54e-06 ***
## tannin -1.2167 0.2186 -5.565 0.000846 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.693 on 7 degrees of freedom
## Multiple R-squared: 0.8157, Adjusted R-squared: 0.7893
## F-statistic: 30.97 on 1 and 7 DF, p-value: 0.0008461
“Call” = the model we fitted “Residuals” = summary ofn the residuals pertaining to this specific model “Coefficients” = here we find the parameter estimates, i.e., slope and intercept values as well as information on their statistical significance. The t-values and p-values are for tests of the null hypotheses that the intercept and slope are equal to zero (against 2-tailed alternatives that they are not equal to zero). Next, lets plot this model that we created above.
#par(mfrow=c(2,2))
plot(model)
Graph at (-1,1) - shows the residuals against the fitted values. It is good news because it shows no evidence if variance increasing for larger values of , and shows no evidence of curvature. Graph at (1,1) - shows the ordered residuals plotted against the quantiles of the standard normal distribution. If, as we hope, our errors really are normal, then this plot should be linear. The data are very well behaved with only point number 4 having a larger residual than expected. Graph at (-1,-1) - very like the first plot but shows the square root of the standardised residuals against the fitted values (useful for detecting non-constant variance). Graph at (1,-1) - shows Cooks distance. This is an influence measure that shows which of the outlying values might be expected to have the largest effects on the estimated parameter values. It is often a good idea to repeat the modelling exercise with the most influential points omitted in order to assess the magnitude of their impact on the structure of the model.
Predicting value of y(or growth) = y cap at x (or tannin)=5.5
predict(model,list(tannin=5.5))
## 1
## 5.063889
This is the same as putting x=5.5 in our regression equation and finding y.
TRANSFORMATION ON NON LINEAR RESPONSES
decay = read.csv("C:\\temp\\Course Data\\decay.csv", header = TRUE)
attach(decay)
names(decay)
## [1] "time" "amount"
plot(time,amount)
Above plot does not show a linear relationship. To visualize better, let’s plot a straight line through this.
plot(time,amount)
abline(lm(amount~time))
This shows groups of residuals. For 5>x>25, residuals are positive while those for 5<x<25 look negative. This is the “evidence of curvature” in this relationship, indicating the systematic inadequacy of this model. Let’s do a linear regression anyway, then look at what the model checking tells us. We call the model object “result” and fit the model as before:
result=lm(amount~time)
summary(result)
##
## Call:
## lm(formula = amount ~ time)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.065 -10.029 -2.058 5.107 40.447
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 84.5534 5.0277 16.82 < 2e-16 ***
## time -2.8272 0.2879 -9.82 9.94e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.34 on 29 degrees of freedom
## Multiple R-squared: 0.7688, Adjusted R-squared: 0.7608
## F-statistic: 96.44 on 1 and 29 DF, p-value: 9.939e-11
p-values are extremely small, so everything looks highly significant. But lets look at the disagnostic plots to see if the model is any good.
plot(result)
The residuals vs fitted graph (graph 1) shows bowing instead of a horizontal straight line. A U-shaped plot of residuals against fitted values means that the linear model is inadequate. We need to account for curvature in the relationship between y and x.
Graph 2 is banana-shaped, instead of being straight as per our expectations. We can see that the standardized residuals spike up drastically above quantiles of +0.5.
Doing a log transformation:
plot(time,amount,log="y")
transformed = lm(log(amount)~time)
summary(transformed)
##
## Call:
## lm(formula = log(amount) ~ time)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5935 -0.2043 0.0067 0.2198 0.6297
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.547386 0.100295 45.34 < 2e-16 ***
## time -0.068528 0.005743 -11.93 1.04e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.286 on 29 degrees of freedom
## Multiple R-squared: 0.8308, Adjusted R-squared: 0.825
## F-statistic: 142.4 on 1 and 29 DF, p-value: 1.038e-12
Like before, eveyrthing looks hihgly significant with the really low p-values. Examining diagnostic plots:
plot(transformed)
plot(time,amount)
smoothtime = seq(0,30,0.1)
smoothamount = exp(predict(transformed,list(time=smoothtime)))
We predict using the current model, which is called ‘transformed’. This takes values of “time” because that was the name of the explanatory variable we used in fitting the model log(amount)~time. We use list to coerce the “smoothtime” values into a vector that will be known internally as “time”, but which will not alter the values of our data called “time”. Finally, we back transform the predicted values to get our “smoothamount” values on the untransformed scale. Since the model was log(amount)~time, the appropriate back transformation is exp(predicted). Next, using lines() to draw the fitted model through the data:
plot(time,amount)
lines(smoothtime,smoothamount)