1. Residual plots & transformations in linear regression

(Amazon.csv)

amazon <- read.csv("Amazon.csv")

Tropical ecologists are interested in using tree diameter (in cm) to predict the age of certain trees growing in the Amazon rainforest. They measured tree age using 14C dating and diameter at breast height of 20 trees. (You do not need to go through the entire 6-step hypothesis testing procedure for this problem).

  1. Use R to generate a bivariate scatterplot, and fit a regression line to the data.
amazon %>% 
  ggplot(aes(x = diameter, y = age)) +
    geom_point() + 
    geom_smooth(method = "lm", se = F, color = "black") + 
    labs(title = "Scatter plot of Age versus Tree Diameter", 
         x = "Tree diameter (cm)", 
         y = "Tree Age (14C-dating arbitrary units)")

  1. Use R to conduct a linear regression to calculate & save the residuals: make a residuals plot and a normal probability plot (Q–Q plot) of the residuals.
amazon.lr <- lm(age ~ diameter, data = amazon)
summary(amazon.lr)
## 
## Call:
## lm(formula = age ~ diameter, data = amazon)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -325.22 -140.67  -81.10   63.47  658.71 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  -18.770    265.148  -0.071   0.9443  
## diameter       4.392      1.948   2.254   0.0369 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 288.7 on 18 degrees of freedom
## Multiple R-squared:  0.2202, Adjusted R-squared:  0.1769 
## F-statistic: 5.082 on 1 and 18 DF,  p-value: 0.03687
amazon$residuals <- amazon.lr$residuals
# residuals regression
amazon %>% 
  ggplot() +
    geom_point(aes(x = diameter, y = residuals)) +
    geom_hline(aes(yintercept = 0)) +
    theme_bw() +
    labs(title = "Residual plots of Residuals versus X", 
         x = "Tree diameter (cm)",
         y = "Residuals")  # TODO: fix labels here

# qq plot
amazon.aov <- aov(age ~ diameter, data = amazon)
amazon.res <- data.frame(residuals = amazon.aov$residuals)
amazon.res %>% 
  ggplot(aes(sample = residuals)) +
    geom_qq() + 
    stat_qqline() +
    coord_flip() +
    labs(title = "Normal probability (Q-Q) plot of residuals",
         x = "Theoretical quantile",
         y = "Residuals")

Briefly describe what these two plots tell you about the regression assumptions. Regression statistical analysis assumes linearity (y = a + bx), normality (values of Y are normally distributed for a value of X), constant variance (equal spread of data in positive and negative direction for each value of X) and random sampling.
The residual plot shows presence of linearity, non-constant variance and outliers.
The Q-Q plot shows that there are three outliers in the residual data set, affording evidence for non-normally distributed data. According to both plots, data violates regression assumptions. c. Repeated analysis on data with log-transformed tree age,

# plot of transformed data
amazon %>% 
  ggplot(aes(x = diameter, y = log(age))) +
    geom_point() + 
    geom_smooth(method = "lm",
                se = F,
                color = "black") +
    labs(title = "Scatter plot of Log of Tree Age versus Tree Diameter",
         x = "Tree diameter (cm)",
         y = "Log of tree age (14C dating units)")

d. Linear regression of with log(age), describing whether the transformed data or the original data are better at meeting the assumptions of linear regression analysis.

# re-do regresion
amazon$log.age <- log(amazon$age)
amazon.transformed.lr <- lm(log.age ~ diameter, data = amazon)
# residuals plots
# residuals regression
amazon$log.residuals <- amazon.transformed.lr$residual
plot(log.residuals ~ diameter, data = amazon, 
     pch = 19, 
     xlab = "Tree Diameter (cm)", 
     ylab = "Residuals") 
abline(lm(log.residuals ~ diameter, data = amazon))

# qq plot
amazon.transformed.aov <- aov(log.age ~ diameter, data = amazon)
amazon.transformed.res <- data.frame(residuals = amazon.transformed.aov$residuals)
amazon.transformed.res %>% 
  ggplot(aes(sample = residuals)) +
    geom_qq() + 
    stat_qqline() +
    coord_flip() +
    labs(title = "Normal probability (Q-Q) plot of residuals",
         x = "Theoretical quantile",
         y = "Residuals")

Briefly describe whether the transformed data or the original data are better at meeting the assumptions of linear regression analysis. Transformed data was better than original data in meeting the assumptions of the data
From residual plot, more even spread on either side of the line, tending better to a more constant variance than that of the original data. Also, data also shows linearity and lacks any prominent outliers.
From Q-Q plot, line is a better fit than that of the original data, showing no outliers and thus suggesting that the residuals are indeed normally distributed.

Assignment Problems from Whitlock & Schulter Chapter 17:

23. A simple linear regression

tsas <- read.csv("TreeSeedlingsAndSunflecks.csv")
  1. The study is experimental, as the researchers had control over each tree’s length of exposure to fleeks of sunlight. That is, they had control over assignment of predictor variables to groups.

  2. A bivariate scatterplot with a best-fit line,

tsas %>% 
  ggplot(aes(x = fleckDuration, y = growth)) +
    geom_point() + 
    geom_smooth(method = "lm",
                se = F,
                color = "black") + 
    labs(x = "Mean fleck duration (min)", y = "Relative growth rate (mm/mm/week)")

Researchers were trying to see how the length of exposure to flecks of sunlight affects growth of seedlings of understory trees in mature tropical rainforest.
Response variable Y => relative growth rate (mm/mm/week)
Predictor variable X => Mean fleck duration (min)

Scatter plot shows linearity with no outliers or influential points. c. Equation for the best-fit line: y = -0.00171 + 0.00253x , where
Y => relative growth rate (mm/mm/week) &
X => Mean fleck duration (min)

tsas.lm <- lm(growth ~ fleckDuration, data = tsas)
tsas.lm
## 
## Call:
## lm(formula = growth ~ fleckDuration, data = tsas)
## 
## Coefficients:
##   (Intercept)  fleckDuration  
##     -0.001709       0.002529
  1. Use your regression equation to find the predicted relative growth rate for tree seedlings exposed to
findGrowth <- function(x) {
  (x * 0.0025295) - 0.0017087
}
findGrowth(4)
## [1] 0.0084093
findGrowth(10)
## [1] 0.0235863
  i) 4 min fleck duration => 0.008 mm/mm/week relative growth rate
    
  ii) 10 min fleck duration => 0.024 mm/mm/week relative growth rate
  
  1. The proportion of the variation in Y is explained by X is given by adjusted R squared,
summary(tsas.lm)
## 
## Call:
## lm(formula = growth ~ fleckDuration, data = tsas)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.007756 -0.003262 -0.000421  0.001991  0.011785 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.0017087  0.0024716  -0.691    0.498    
## fleckDuration  0.0025295  0.0004483   5.642 1.93e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.004488 on 19 degrees of freedom
## Multiple R-squared:  0.6262, Adjusted R-squared:  0.6066 
## F-statistic: 31.84 on 1 and 19 DF,  p-value: 1.934e-05

Adjusted R^2 = 0.607

  1. Testing the hypothesis that there is a linear dependence of Y on X.
    1. Stating the statistical hypothesis,
      Ho : Beta, B = 0
      Ha : Beta, B != 0
    2. We choose a regression analysis for two reasons, => We are looking for the relationship between two quantitative variables &
      => We want to use how length of exposure to flecks of sunlight, X explains growth, Y.

      Assumptions,
      => each X, Y value represents random samples
      => there is a true linear relationship between X and Y, governed by the equation: Y = alpha + (Beta)X
      => At each value of X, the population distribution of Y is normal (ie, residuals are normally distributed).
      => At each value of X, the variance of Y-values is equal (ie, residuals have equal variance across all X).
      checks for assumption,
tsas$residuals <- tsas.lm$residuals
ggplot(tsas, aes(sample = residuals)) +
  geom_qq() +
  stat_qqline() +
  coord_flip()

ggplot(tsas) +
  geom_point(aes(x = fleckDuration, y = residuals)) +
  geom_hline(aes(yintercept = 0)) +
  theme_bw()

The above Q-Q plot shows an outlier that violates residual normality distribution assumption.
Residual plot also shows this outlier > 0.010. Additionally, though it shows linearity that affirms a regression assumption, it depicts non-constant variance that violates another. 3. Analysis and test statistic

tsas.lm <- lm(growth ~ fleckDuration, data = tsas)
summary(tsas.lm)
## 
## Call:
## lm(formula = growth ~ fleckDuration, data = tsas)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.007756 -0.003262 -0.000421  0.001991  0.011785 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.0017087  0.0024716  -0.691    0.498    
## fleckDuration  0.0025295  0.0004483   5.642 1.93e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.004488 on 19 degrees of freedom
## Multiple R-squared:  0.6262, Adjusted R-squared:  0.6066 
## F-statistic: 31.84 on 1 and 19 DF,  p-value: 1.934e-05

F statistic = 31.84 &
in form y = a + bx, the regression equation is y = - 0.00171 + (0.00253)x , where
Y => relative growth rate (mm/mm/week) &
X => Mean fleck duration (min) 4. p-value < 0.0001 5. Since p-value < 0.05, we reject the null hypothesis 6. Growth (mm/mm/week) of seedlings of understory trees in mature tropical rainforests increased significantly with and is linearly dependent on intermittent of flecks of sunlight (min) (Linear regression, F(1,19) = 31.84, p < 0.0001).

  1. The 95% CI for your estimated slope,
confint(tsas.lm)
##                      2.5 %      97.5 %
## (Intercept)   -0.006881674 0.003464357
## fleckDuration  0.001591156 0.003467768

0.0016, 0.0035 (mm/mm/week per min) OR
0.0253 +/- 0.0009 (mm/mm/week per min), in form estimated slope +/- 95% CI

30. Interpreting regression plots:

Data from text on estimating cadaver birth dates.

  1. Label the regression line (draw/write on your printed homework) Solid line (label a) => least-squares regression line, predicting the actual year of birth from the estimate based on amount of 14C
  2. Label the confidence bands—what do these confidence bands tell us?
    narrower => 95% confidence interval bands (label b), predicting the mean Y for a given X
    wider => 95% prediction interval (labels c), predicting a single Y for a given X