The data represents results from an environmental engineering study of a certain chemical reaction. The concentrations of 18 separately prepared solutions were recorded at different times (three measurements at each of six times). The natural logarithms of the concentrations were also computed.
data = read.csv("week4-HW-data.csv", header = T, sep = ",", row.names = 1)
attach(data)
library(ggplot2)
library(gridExtra)
library(ggthemes)
p1 = ggplot(data, aes(time, concen)) + geom_point() +
ggtitle("Concentration (Y) vs. Time (X)") +
theme_stata() +
scale_color_stata() +
xlab("Time") +
ylab("Concentration")
p2 = ggplot(data, aes(time, lnconc)) + geom_point() +
ggtitle("Concentration (lnY) vs. Time (X)") +
theme_stata() +
scale_color_stata() +
xlab("Time") +
ylab("Concentration (natural logarithms)")
marrangeGrob(list(p1,p2), nrow = 1, ncol = 2, top = " ")
Using the output from exercise one, obtain the following:
\[\hat y = -1.932 + 0.246\times time.\]
y1 = lm( concen ~ time); y1
##
## Call:
## lm(formula = concen ~ time)
##
## Coefficients:
## (Intercept) time
## -1.932 0.246
\[\hat y = 3.172 - 0.781\times time + 0.047 \times time^2.\]
y2 = lm( concen ~ time + I(time ^ 2)); y2
##
## Call:
## lm(formula = concen ~ time + I(time^2))
##
## Coefficients:
## (Intercept) time I(time^2)
## 3.17205 -0.78102 0.04668
\[\hat y = -6.210 + 0.451\times time.\]
y3 = lm( lnconc ~ time); y3
##
## Call:
## lm(formula = lnconc ~ time)
##
## Coefficients:
## (Intercept) time
## -6.2096 0.4512
p1 = ggplot(data, aes(time, concen)) + geom_point() +
ggtitle("Concentration (Y) vs. Time (X)") +
theme_stata() +
scale_color_stata() +
geom_smooth(method = "lm",
se = F,
colour = "red") +
geom_smooth(method = "lm",
se = F,
formula = y ~ poly(x,2),
colour = "blue")
p2 = ggplot(data, aes(time, lnconc)) + geom_point() +
ggtitle("Concentration (lnY) vs. Time (X)") +
theme_stata() +
scale_color_stata() +
geom_abline(intercept = coef(y3)[1], slope = coef(y3)[2],
size = 1.1, colour = "red")
marrangeGrob(list(p1, p2), nrow = 1, ncol = 2, top = "")
Determine and compare the proportions of the total variation in \(Y\) explained by the straight-line regression on \(X\) and by the quadratic regression on \(X\).
summary(y1)$r.squared
## [1] 0.7319874
library(car)
1/vif(y2) # higher than 0.01
## time I(time^2)
## 0.01732539 0.01732539
summary(y2)$r.squared
## [1] 0.9569672
summary(y1)$fstatistic
## value numdf dendf
## 43.69869 1.00000 16.00000
p = pf(43.69869, 1, 16, lower.tail = F); p
## [1] 5.992676e-06
If we take a confidence level of 5%, we can reject the null hypothesis (\(\beta_1 = 0\)), because the probability of error in rejecting the true null hypothesis would be 0% (less than 5% previously admitted).
summary(y2)$fstatistic
## value numdf dendf
## 166.7856 2.0000 15.0000
p = pf(166.7856, 2, 15, lower.tail = F); p
## [1] 5.668892e-11
If we take a confidence level of 5%, we can reject the null hypothesis (\(\beta_1 = \beta_2 = 0\)), because the probability of error in rejecting the true null hypothesis would be 0% (less than 5% previously admitted).
summary(y2)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.17205238 0.603016295 5.260310 9.603463e-05
## time -0.78102262 0.116989158 -6.676026 7.403165e-06
## I(time^2) 0.04668155 0.005271422 8.855589 2.411187e-07
anova(y1,y2)
## Analysis of Variance Table
##
## Model 1: concen ~ time
## Model 2: concen ~ time + I(time^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 16 4.6520
## 2 15 0.7469 1 3.9051 78.421 2.411e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If we take a confidence level of 5%, we can reject the null hypothesis (\(\beta_2 = 0\)), because the probability of error in rejecting the true null hypothesis would be 0% (less than 5% previously admitted). So, we conclude that \(X^2\) can explain some variation of \(Y\).
summary(y3)$fstatistic
## value numdf dendf
## 4550.154 1.000 16.000
p = pf(4550.154, 1, 16, lower.tail = F); p
## [1] 4.470246e-21
If we take a confidence level of 5%, we can reject the null hypothesis (\(\beta_1 = 0\)), because the probability of error in rejecting the true null hypothesis would be 0% (less than 5% previously admitted).
summary(y3)$r.squared
## [1] 0.996496
As we can expect of the plots, the straight-line regression of \(lnY\) on \(X\) looks fits better than the quadratic.
Use the homework forum to explain your thoughts on the following:
Taking natural logarithms of the concentrations helps with regard to the assumption of variance homogeneity because in this case the variance is taken proporcionally (percentual).
Analysing if the straight-line regression of the \(lnY\) on \(X\) fits better than of the quadratic regression of the Y on X:
Plotting of the residuals for quadratic regression of the \(Y\) on \(X\):
par(mfrow = c(3,2))
plot(y2, which = c(1:6) )
par(mfrow = c(1,1))
Tests:
library(car)
library(lmtest)
# Normality of the errors: H_0: residuals come from normal population
shapiro.test(resid(y2))
##
## Shapiro-Wilk normality test
##
## data: resid(y2)
## W = 0.91423, p-value = 0.1022
# Independence of the errors: H_0: rho = 0
durbinWatsonTest(y2)
## lag Autocorrelation D-W Statistic p-value
## 1 0.2265476 1.222294 0.014
## Alternative hypothesis: rho != 0
# Tests for homoscedasticity errors: H_0: equal variances
bptest(y2)
##
## studentized Breusch-Pagan test
##
## data: y2
## BP = 5.3469, df = 2, p-value = 0.06901
# Outliers:
outlierTest(y2)
##
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferonni p
## 18 3.133142 0.0073333 0.132
dfbeta(y2) # change in coefficients when the i^th point is deleted in the model
## (Intercept) time I(time^2)
## 1 -0.214065183 0.035489754 -0.0014083236
## 2 -0.209393052 0.034715164 -0.0013775859
## 3 -0.217179938 0.036006148 -0.0014288154
## 4 0.009330769 0.003443498 -0.0002777015
## 5 0.008940849 0.003299599 -0.0002660967
## 6 0.009832095 0.003628511 -0.0002926219
## 7 -0.093566149 0.021750026 -0.0010259446
## 8 -0.083653106 0.019445678 -0.0009172490
## 9 -0.105957453 0.024630461 -0.0011618142
## 10 0.071639337 -0.015012655 0.0006584498
## 11 0.101943685 -0.021363199 0.0009369824
## 12 0.036900207 -0.007732764 0.0003391563
## 13 0.075762372 -0.013480242 0.0004265899
## 14 0.108149640 -0.019242841 0.0006089507
## 15 0.040725237 -0.007246157 0.0002293088
## 16 0.155368852 -0.035795765 0.0019040301
## 17 -0.134745902 0.031044399 -0.0016512978
## 18 0.395319672 -0.091078552 0.0048446038
dffits(y2) # change in Y when the i^th point is deleted in the model
## 1 2 3 4 5 6
## -0.43646151 -0.42660320 -0.44304664 0.26409611 0.25260884 0.27895859
## 7 8 9 10 11 12
## 0.26726002 0.23808369 0.30420916 -0.16973449 -0.24335743 -0.08696337
## 13 14 15 16 17 18
## -0.41830209 -0.63426954 -0.21657466 0.59910601 -0.51525975 1.92388476
Plotting of the residuals for straight-line regression of the lnY on X:
par(mfrow = c(3,2))
plot(y3, which = c(1:6) )
par(mfrow = c(1,1))
Tests:
library(car)
library(lmtest)
# Normality of the errors: H_0: residuals come from normal population
shapiro.test(resid(y3))
##
## Shapiro-Wilk normality test
##
## data: resid(y3)
## W = 0.97257, p-value = 0.8443
# Independence of the errors: H_0: rho = 0
durbinWatsonTest(y3)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.3989993 2.689829 0.252
## Alternative hypothesis: rho != 0
# Tests for homoscedasticity errors: H_0: equal variances
bptest(y3)
##
## studentized Breusch-Pagan test
##
## data: y3
## BP = 0.037195, df = 1, p-value = 0.8471
# Outliers:
outlierTest(y3)
##
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferonni p
## 6 2.01977 0.061643 NA
dfbeta(y3) # change in coefficients when the i^th point is deleted in the model
## (Intercept) time
## 1 -0.0144017094 1.080128e-03
## 2 0.0232905983 -1.746795e-03
## 3 -0.0420940171 3.157051e-03
## 4 0.0146791862 -9.859155e-04
## 5 -0.0072609546 4.876761e-04
## 6 0.0401580595 -2.697183e-03
## 7 -0.0012762763 5.630631e-05
## 8 -0.0119587087 5.275901e-04
## 9 0.0107845345 -4.757883e-04
## 10 -0.0002042042 -3.063063e-04
## 11 -0.0005454204 -8.181306e-04
## 12 0.0001505255 2.257883e-04
## 13 -0.0017151800 2.411972e-04
## 14 0.0097777778 -1.375000e-03
## 15 -0.0129827856 1.825704e-03
## 16 -0.0062222222 7.179487e-04
## 17 0.0267777778 -3.089744e-03
## 18 -0.0304722222 3.516026e-03
dffits(y3) # change in Y when the i^th point is deleted in the model
## 1 2 3 4 5 6
## -0.19044548 0.31087388 -0.58231255 0.21943674 -0.10735697 0.66730378
## 7 8 9 10 11 12
## -0.02902283 -0.28289653 0.25316303 -0.15989674 -0.46678051 0.11715692
## 13 14 15 16 17 18
## 0.05295652 -0.31037940 0.42169136 0.12618500 -0.56861753 0.65701695
If we take a confidence level of 5%, in straight-line regression of the \(lnY\) on \(X\) we haven’t significative evidence for reject the null hypothesis (\(H_0\)) for normality, independence and homoscedasticity of the residuals (the probability of error in rejecting the true null hypothesis would be higher than 5% previously admitted). Also, the plotting of residuals and tests for outliers doesn’t show us problem with extreme points. This is better than of results for quadratic regression of the \(Y\) on \(X\) that show possible problem with independence of residuals (maybe a regression linear is not good, it produce outliers).
detach(data)