Homework4 - Applied Regression Analysis

The Data

The data represents results from an environmental engineering study of a certain chemical reaction. The concentrations of 18 separately prepared solutions were recorded at different times (three measurements at each of six times). The natural logarithms of the concentrations were also computed.

data = read.csv("week4-HW-data.csv", header = T, sep = ",", row.names = 1)
attach(data)

Exercise One:

Generate separate graphs of:

Concentration (Y) vs. Time (X)
Natural Logarithm of Concentration (lnY) vs. Time (X)

Solving

library(ggplot2)
library(gridExtra)
library(ggthemes)
p1 = ggplot(data, aes(time, concen)) + geom_point() + 
        ggtitle("Concentration (Y) vs. Time (X)") + 
        theme_stata() + 
        scale_color_stata() +
        xlab("Time") +
        ylab("Concentration")
p2 = ggplot(data, aes(time, lnconc)) + geom_point() + 
        ggtitle("Concentration (lnY) vs. Time (X)") + 
        theme_stata() + 
        scale_color_stata() +
        xlab("Time") +
        ylab("Concentration (natural logarithms)")
marrangeGrob(list(p1,p2), nrow = 1, ncol = 2, top = " ")

Exercise Two:

Equations and Plotting:

Using the output from exercise one, obtain the following:

The estimated equation of the straight-line (degree 1) regression of \(Y\) on \(X\).
The estimated equation of the quadratic (degree 2) regression of \(Y\) on \(X\).
The estimated equation of the straight-line (degree 1) regression of \(lnY\) on \(X\).
Plots of each of these fitted equations on their respective scatter diagrams.

Solving

Equation of the Straight-line (degree 1) regression of \(Y\) on \(X\):

\[\hat y = -1.932 + 0.246\times time.\]

y1 = lm( concen ~ time); y1

## 
## Call:
## lm(formula = concen ~ time)
## 
## Coefficients:
## (Intercept)         time  
##      -1.932        0.246

Equation of the quadratic (degree 2) regression of \(Y\) on \(X\):

\[\hat y = 3.172 - 0.781\times time + 0.047 \times time^2.\]

y2 = lm( concen ~ time + I(time ^ 2)); y2

## 
## Call:
## lm(formula = concen ~ time + I(time^2))
## 
## Coefficients:
## (Intercept)         time    I(time^2)  
##     3.17205     -0.78102      0.04668

Equation of the Straight-line (degree 1) regression of \(lnY\) on \(X\):

\[\hat y = -6.210 + 0.451\times time.\]

y3 = lm( lnconc ~ time); y3

## 
## Call:
## lm(formula = lnconc ~ time)
## 
## Coefficients:
## (Intercept)         time  
##     -6.2096       0.4512

p1 = ggplot(data, aes(time, concen)) + geom_point() + 
        ggtitle("Concentration (Y) vs. Time (X)") + 
        theme_stata() + 
        scale_color_stata() +
        geom_smooth(method = "lm", 
                    se = F, 
                    colour = "red") +
        geom_smooth(method = "lm", 
                    se = F, 
                    formula = y ~ poly(x,2),
                    colour = "blue")
p2 = ggplot(data, aes(time, lnconc)) + geom_point() + 
        ggtitle("Concentration (lnY) vs. Time (X)") + 
        theme_stata() + 
        scale_color_stata() +
        geom_abline(intercept = coef(y3)[1], slope = coef(y3)[2], 
                    size = 1.1, colour = "red")
marrangeGrob(list(p1, p2), nrow = 1, ncol = 2, top = "")

Exercise Three:

Determine and Compare

Determine and compare the proportions of the total variation in \(Y\) explained by the straight-line regression on \(X\) and by the quadratic regression on \(X\).

Solving

We have 73.2 % of the total variation in \(Y\) explained by the straight-line regression on \(X\).

summary(y1)$r.squared

## [1] 0.7319874

We have 95.7 % of the total variation in \(Y\) explained by the quadratic regression on \(X\). As we can expect of the plots, the quadratic line looks fits better than the straight line.

library(car)
1/vif(y2) # higher than 0.01

##       time  I(time^2) 
## 0.01732539 0.01732539

summary(y2)$r.squared

## [1] 0.9569672

Exercise Four

F-Tests

Carry out F-tests for the significance of the straight-line regression of \(Y\) on \(X\).
Carry out an overall F-test for the significance of the quadratic regression of \(Y\) on \(X\) and a test for the significance of the addition of \(x^2\) to the model.
For the straight-line regression of \(lnY\) on \(X\), carry out F-tests for the significance of the overall regression.

Solving

F-tests for the significance of the straight-line regression of \(Y\) on \(X\):

summary(y1)$fstatistic

##    value    numdf    dendf 
## 43.69869  1.00000 16.00000

p = pf(43.69869, 1, 16, lower.tail = F); p

## [1] 5.992676e-06

If we take a confidence level of 5%, we can reject the null hypothesis (\(\beta_1 = 0\)), because the probability of error in rejecting the true null hypothesis would be 0% (less than 5% previously admitted).

Overall F-test for the significance of the quadratic regression of \(Y\) on \(X\):

summary(y2)$fstatistic

##    value    numdf    dendf 
## 166.7856   2.0000  15.0000

p = pf(166.7856, 2, 15, lower.tail = F); p

## [1] 5.668892e-11

If we take a confidence level of 5%, we can reject the null hypothesis (\(\beta_1 = \beta_2 = 0\)), because the probability of error in rejecting the true null hypothesis would be 0% (less than 5% previously admitted).

F-test for the significance of the addition of \(x^2\) to the model:

summary(y2)$coefficients

##                Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept)  3.17205238 0.603016295  5.260310 9.603463e-05
## time        -0.78102262 0.116989158 -6.676026 7.403165e-06
## I(time^2)    0.04668155 0.005271422  8.855589 2.411187e-07

anova(y1,y2)

## Analysis of Variance Table
## 
## Model 1: concen ~ time
## Model 2: concen ~ time + I(time^2)
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     16 4.6520                                  
## 2     15 0.7469  1    3.9051 78.421 2.411e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If we take a confidence level of 5%, we can reject the null hypothesis (\(\beta_2 = 0\)), because the probability of error in rejecting the true null hypothesis would be 0% (less than 5% previously admitted). So, we conclude that \(X^2\) can explain some variation of \(Y\).

Overall F-test for the significance of the straight-line regression of \(lnY\) on \(X\):

summary(y3)$fstatistic

##    value    numdf    dendf 
## 4550.154    1.000   16.000

p = pf(4550.154, 1, 16, lower.tail = F); p

## [1] 4.470246e-21

Exercise Five

Determine and Compare

What proportion of the variation in \(lnY\) is explained by the straight-line regression of \(lnY\) on \(X\)?
Compare this result with that obtained in Exercise Three for the quadratic regression of \(Y\) on \(X\). Discuss this in the homework forum.

Solving

We have 99.65% of the total variation in \(lnY\) explained by the straight-line regression on \(X\).

summary(y3)$r.squared

## [1] 0.996496

For result in Exercise Three, we have 95.7 % of the total variation in \(Y\) explained by the quadratic regression on \(X\).

As we can expect of the plots, the straight-line regression of \(lnY\) on \(X\) looks fits better than the quadratic.

Exercise Six

Examine and Discuss

Use the homework forum to explain your thoughts on the following:

A fundamental assumption in regression analysis is variance homoscedasticity. By examining the scatter diagrams constructed in Exercises One & Two, state why taking natural logarithms of the concentrations helps with regard to the assumption of variance homogeneity.
Do you think the straight-line regression of lnY on X is better for describing this set of data than the quadratic regression of Y on X? Share your thoughts in the homework forum.
Considering the overall table, what key assumption about the data would be in question if, instead of 18 different solutions, there were only 3 different solutions, each of which was analyzed at the six different time points?

Solving

Taking natural logarithms of the concentrations helps with regard to the assumption of variance homogeneity because in this case the variance is taken proporcionally (percentual).
Analysing if the straight-line regression of the \(lnY\) on \(X\) fits better than of the quadratic regression of the Y on X:

Plotting of the residuals for quadratic regression of the \(Y\) on \(X\):

par(mfrow = c(3,2))
plot(y2, which =  c(1:6) )

par(mfrow = c(1,1))

Tests:

library(car)
library(lmtest)
# Normality of the errors: H_0: residuals come from normal population
shapiro.test(resid(y2))

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(y2)
## W = 0.91423, p-value = 0.1022

# Independence of the errors: H_0: rho = 0
durbinWatsonTest(y2)

##  lag Autocorrelation D-W Statistic p-value
##    1       0.2265476      1.222294   0.014
##  Alternative hypothesis: rho != 0

# Tests for homoscedasticity errors:  H_0: equal variances
bptest(y2)

## 
##  studentized Breusch-Pagan test
## 
## data:  y2
## BP = 5.3469, df = 2, p-value = 0.06901

# Outliers:
outlierTest(y2)

## 
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##    rstudent unadjusted p-value Bonferonni p
## 18 3.133142          0.0073333        0.132

dfbeta(y2) # change in coefficients when the i^th point is deleted in the model

##     (Intercept)         time     I(time^2)
## 1  -0.214065183  0.035489754 -0.0014083236
## 2  -0.209393052  0.034715164 -0.0013775859
## 3  -0.217179938  0.036006148 -0.0014288154
## 4   0.009330769  0.003443498 -0.0002777015
## 5   0.008940849  0.003299599 -0.0002660967
## 6   0.009832095  0.003628511 -0.0002926219
## 7  -0.093566149  0.021750026 -0.0010259446
## 8  -0.083653106  0.019445678 -0.0009172490
## 9  -0.105957453  0.024630461 -0.0011618142
## 10  0.071639337 -0.015012655  0.0006584498
## 11  0.101943685 -0.021363199  0.0009369824
## 12  0.036900207 -0.007732764  0.0003391563
## 13  0.075762372 -0.013480242  0.0004265899
## 14  0.108149640 -0.019242841  0.0006089507
## 15  0.040725237 -0.007246157  0.0002293088
## 16  0.155368852 -0.035795765  0.0019040301
## 17 -0.134745902  0.031044399 -0.0016512978
## 18  0.395319672 -0.091078552  0.0048446038

dffits(y2) # change in Y when the i^th point is deleted in the model

##           1           2           3           4           5           6 
## -0.43646151 -0.42660320 -0.44304664  0.26409611  0.25260884  0.27895859 
##           7           8           9          10          11          12 
##  0.26726002  0.23808369  0.30420916 -0.16973449 -0.24335743 -0.08696337 
##          13          14          15          16          17          18 
## -0.41830209 -0.63426954 -0.21657466  0.59910601 -0.51525975  1.92388476

Plotting of the residuals for straight-line regression of the lnY on X:

par(mfrow = c(3,2))
plot(y3, which =  c(1:6) )

par(mfrow = c(1,1))

Tests:

library(car)
library(lmtest)
# Normality of the errors: H_0: residuals come from normal population
shapiro.test(resid(y3))

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(y3)
## W = 0.97257, p-value = 0.8443

# Independence of the errors: H_0: rho = 0
durbinWatsonTest(y3)

##  lag Autocorrelation D-W Statistic p-value
##    1      -0.3989993      2.689829   0.252
##  Alternative hypothesis: rho != 0

# Tests for homoscedasticity errors:  H_0: equal variances
bptest(y3)

## 
##  studentized Breusch-Pagan test
## 
## data:  y3
## BP = 0.037195, df = 1, p-value = 0.8471

# Outliers:
outlierTest(y3)

## 
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##   rstudent unadjusted p-value Bonferonni p
## 6  2.01977           0.061643           NA

dfbeta(y3) # change in coefficients when the i^th point is deleted in the model

##      (Intercept)          time
## 1  -0.0144017094  1.080128e-03
## 2   0.0232905983 -1.746795e-03
## 3  -0.0420940171  3.157051e-03
## 4   0.0146791862 -9.859155e-04
## 5  -0.0072609546  4.876761e-04
## 6   0.0401580595 -2.697183e-03
## 7  -0.0012762763  5.630631e-05
## 8  -0.0119587087  5.275901e-04
## 9   0.0107845345 -4.757883e-04
## 10 -0.0002042042 -3.063063e-04
## 11 -0.0005454204 -8.181306e-04
## 12  0.0001505255  2.257883e-04
## 13 -0.0017151800  2.411972e-04
## 14  0.0097777778 -1.375000e-03
## 15 -0.0129827856  1.825704e-03
## 16 -0.0062222222  7.179487e-04
## 17  0.0267777778 -3.089744e-03
## 18 -0.0304722222  3.516026e-03

dffits(y3) # change in Y when the i^th point is deleted in the model

##           1           2           3           4           5           6 
## -0.19044548  0.31087388 -0.58231255  0.21943674 -0.10735697  0.66730378 
##           7           8           9          10          11          12 
## -0.02902283 -0.28289653  0.25316303 -0.15989674 -0.46678051  0.11715692 
##          13          14          15          16          17          18 
##  0.05295652 -0.31037940  0.42169136  0.12618500 -0.56861753  0.65701695

If we take a confidence level of 5%, in straight-line regression of the \(lnY\) on \(X\) we haven’t significative evidence for reject the null hypothesis (\(H_0\)) for normality, independence and homoscedasticity of the residuals (the probability of error in rejecting the true null hypothesis would be higher than 5% previously admitted). Also, the plotting of residuals and tests for outliers doesn’t show us problem with extreme points. This is better than of results for quadratic regression of the \(Y\) on \(X\) that show possible problem with independence of residuals (maybe a regression linear is not good, it produce outliers).

Independence.

detach(data)