Applied Regression Analysis Week 4 Homework

Note: Basically all of this stuff can be done simply by looking at model summaries and ANOVA tables. However, it never hurts to recall the structure of lm objects.

Download and read data

url <- "https://d396qusza40orc.cloudfront.net/appliedregression/Homeworks/week4-HW-data.csv"
download.file(url, "conc_vs_time.csv", mode = "wb")
conc <- read.csv("conc_vs_time.csv")

Exercise One

Generate separate graphs of:

Concentration \((Y)\) vs. Time \((X)\)

library(ggplot2)
(plot1 <- ggplot(conc, aes(time, concen)) + geom_point() + theme_bw() + 
                       labs(title = "Plot 1", x = "Time", y = "Concentration"))

Natural Logarithm of Concentration \((lnY)\) vs. Time \((X)\)

(plot2 <- ggplot(conc, aes(time, lnconc)) + geom_point() + theme_bw() + 
                labs(title = "Plot 2", x = "Time", 
                     y = "Natural Logarithm of Concentration"))

Exercise Two

Preliminary work

f <- c("concen ~ time", "concen ~ time + I(time^2)", "lnconc ~ time")
form <- lapply(f, as.formula, env = new.env())
mod_list <- lapply(form, function(x) lm(x, data = conc))

(mycoefs <- sapply(mod_list, "[[", 1))

## [[1]]
## (Intercept)        time 
##  -1.9317968   0.2459714 
## 
## [[2]]
## (Intercept)        time   I(time^2) 
##  3.17205238 -0.78102262  0.04668155 
## 
## [[3]]
## (Intercept)        time 
##  -6.2095556   0.4511667

Using the output from exercise one, obtain the following:

The estimated equation of the straight-line (degree 1) regression of \(Y\) on \(X\)
\(\hat{y} = -1.93 + 0.25*time\)
The estimated equation of the quadratic (degree 2) regression of \(Y\) on \(X\)
\(\hat{y} = 3.17 - 0.78*time + 0.05*{time^2}\)
The estimated equation of the straight-line (degree 1) regression of \(lnY\) on \(X\)
\(\hat{y} = -6.21 + 0.45*time\)
Plots of each of these fitted equations on their respective scatter diagrams.

plot1 + geom_smooth(col="red", method = "lm", se = FALSE) + 
                geom_smooth(col="blue", method = "lm", 
                            formula = y ~ poly(x, 2, raw=T), se = FALSE) + 
                ggtitle("Plot 3")

plot2 + geom_smooth(col="red", method = "lm", se = FALSE) +
                ggtitle("Plot 4")

Exercise Three

Determine and compare the proportions of the total variation in \(Y\) explained by the straight-line regression on \(X\) and by the quadratic regression on \(X\)

# get R-squared from the model summaries:
mod_sum <- sapply(mod_list, summary)
unlist(mod_sum[8, -3])

## [1] 0.7319874 0.9569672

R-squared of the quadratic model is much bigger, just as expected.

Exercise Four

Overall F-tests

# make anova tables
antables <- lapply(mod_list, anova)
# extract sums of squares and DFs
sumsq <- lapply(antables, "[", c(1, 2))

anova2 <- sumsq[[2]] # just for future

# add up due regression SS in the 2nd model
sumsq[[2]] <- rbind(sumsq[[2]][1, ] + sumsq[[2]][2, ], sumsq[[2]][3, ])
# messy indexing halore!

Sums of squares:

## [[1]]
##           Df Sum Sq
## time       1 12.705
## Residuals 16  4.652
## 
## [[2]]
##           Df     Sum Sq
## time       2 16.6104752
## Residuals 15  0.7469385
## 
## [[3]]
##           Df Sum Sq
## time       1 42.746
## Residuals 16  0.150

Mean squares:

# get mean squares:
(meansq <- lapply(sumsq, function(x) x[2] / x[1]))

## [[1]]
##               Sum Sq
## time      12.7054082
## Residuals  0.2907504
## 
## [[2]]
##              Sum Sq
## time      8.3052376
## Residuals 0.0497959
## 
## [[3]]
##                 Sum Sq
## time      42.745785833
## Residuals  0.009394361

F-values:

( fvals <- sapply(meansq, function(x) x[1, ] / x[2, ]) )

## [1]   43.69869  166.78556 4550.15358

P-values:

pf(c(fvals[1], fvals[3]), 1, 16, lower.tail = FALSE)

## [1] 5.992679e-06 4.470250e-21

pf(fvals[2], 2, 15, lower.tail = FALSE)

## [1] 5.668901e-11

All three regression models are significant.

Test for the significance of the addition of \({x^2}\) to the model

(MS due to square term / residual MS) is our F-statistic:

(fstat <- 15 * anova2[2, 2] / anova2[3, 2])

## [1] 78.42145

Compare it to the \(F (1, n−1−k)\)

pf(fstat, 1, 15, lower.tail = FALSE)

## [1] 2.411187e-07

Yes, perfomance of our model has been significantly improved with addition of the square term.

Or we coud just get the ANOVA table

antables[[2]]

## Analysis of Variance Table
## 
## Response: concen
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## time       1 12.7054 12.7054 255.150 7.966e-11 ***
## I(time^2)  1  3.9051  3.9051  78.421 2.411e-07 ***
## Residuals 15  0.7469  0.0498                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

and look at \(Pr(>F)\) of the second term.

Exercise Five

What proportion of the variation in \(lnY\) is explained by the straight-line regression of \(lnY\) on \(X\)?

unlist(mod_sum[8, 3])

## r.squared 
##  0.996496

This is a really good fit.

Compare this result with that obtained in Exercise Three for the quadratic regression of \(lnY\) on \(X\)

unlist(mod_sum[8, -1])

## [1] 0.9569672 0.9964960

And it’s better than the polynomial.
As we’ve seen in the Plot 4, quadratic regression line fits much better than the straight one, but still does not capture many points accurately. Logarithmic transformation actually can save a lot of trouble with non-normality, non-linearity and heteroscedasticity.

Exercise Six

A fundamental assumption in regression analysis is variance homoscedasticity. By examining the scatter diagrams constructed in Exercises One & Two, state why taking natural logarithms of the concentrations helps with regard to the assumption of variance homogeneity.

Let’s take a look at the histogram of concentration:

ggplot(conc, aes(x=concen)) + geom_histogram(aes(y = ..density..), binw = .2,
                                           col="black", fill = "yellowgreen") +
                theme_bw() + ggtitle("Plot 5")

And see what happens after the log transformation:

ggplot(conc, aes(x = lnconc)) + geom_histogram(aes(y = ..density..), binw = .2,
                                             col="black", fill="yellowgreen") + 
                theme_bw() + ggtitle("Plot 6")

shapiro.test(conc$lnconc)

## 
##  Shapiro-Wilk normality test
## 
## data:  conc$lnconc
## W = 0.93949, p-value = 0.284

Distribution of raw data has a distinct right skew. High values naturally tend to be more scattered than lower ones. Log transformation brings the data together, and variance is no longer associated with the mean. Maybe a shallow explanation, I appreciate if you can come up with something more strict and coherent.

Do you think the straight-line regression of \(lnY\) on \(X\) is better for describing this set of data than the quadratic regression of \(Y\) on \(X\)?

Quadratic model violates homoscedasticity assumption and fits a bit worse, so yes, I do.

Considering the overall table, what key assumption about the data would be in question if, instead of 18 different solutions, there were only 3 different solutions, each of which was analyzed at the six different time points?

Independency of observations. We’d have to switch to some different class of models.

Appendix

Full model summaries and ANOVA tables

lapply(mod_list, summary)

## [[1]]
## 
## Call:
## lm(formula = x, data = conc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6359 -0.3771 -0.1134  0.4775  1.0953 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.93180    0.42858  -4.507 0.000358 ***
## time         0.24597    0.03721   6.610 5.99e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5392 on 16 degrees of freedom
## Multiple R-squared:  0.732,  Adjusted R-squared:  0.7152 
## F-statistic:  43.7 on 1 and 16 DF,  p-value: 5.993e-06
## 
## 
## [[2]]
## 
## Call:
## lm(formula = x, data = conc)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.36732 -0.13822 -0.07342  0.16576  0.47283 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.172052   0.603016   5.260 9.60e-05 ***
## time        -0.781023   0.116989  -6.676 7.40e-06 ***
## I(time^2)    0.046682   0.005271   8.856 2.41e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2231 on 15 degrees of freedom
## Multiple R-squared:  0.957,  Adjusted R-squared:  0.9512 
## F-statistic: 166.8 on 2 and 15 DF,  p-value: 5.669e-11
## 
## 
## [[3]]
## 
## Call:
## lm(formula = x, data = conc)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.161444 -0.080194  0.002056  0.061806  0.170222 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.209556   0.077038  -80.60   <2e-16 ***
## time         0.451167   0.006688   67.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09692 on 16 degrees of freedom
## Multiple R-squared:  0.9965, Adjusted R-squared:  0.9963 
## F-statistic:  4550 on 1 and 16 DF,  p-value: < 2.2e-16

antables

## [[1]]
## Analysis of Variance Table
## 
## Response: concen
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## time       1 12.705 12.7054  43.699 5.993e-06 ***
## Residuals 16  4.652  0.2908                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## [[2]]
## Analysis of Variance Table
## 
## Response: concen
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## time       1 12.7054 12.7054 255.150 7.966e-11 ***
## I(time^2)  1  3.9051  3.9051  78.421 2.411e-07 ***
## Residuals 15  0.7469  0.0498                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## [[3]]
## Analysis of Variance Table
## 
## Response: lnconc
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## time       1 42.746  42.746  4550.2 < 2.2e-16 ***
## Residuals 16  0.150   0.009                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1