(Amazon.csv)
amazon <- read.csv("Amazon.csv")
Tropical ecologists are interested in using tree diameter (in cm) to predict the age of certain trees growing in the Amazon rainforest. They measured tree age using 14C dating and diameter at breast height of 20 trees. (You do not need to go through the entire 6-step hypothesis testing procedure for this problem).
amazon %>%
ggplot(aes(x = diameter, y = age)) +
geom_point() +
geom_smooth(method = "lm", se = F, color = "black") +
labs(title = "Scatter plot of Age versus Tree Diameter",
x = "Tree diameter (cm)",
y = "Tree Age (14C-dating arbitrary units)")
amazon.lr <- lm(age ~ diameter, data = amazon)
summary(amazon.lr)
##
## Call:
## lm(formula = age ~ diameter, data = amazon)
##
## Residuals:
## Min 1Q Median 3Q Max
## -325.22 -140.67 -81.10 63.47 658.71
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -18.770 265.148 -0.071 0.9443
## diameter 4.392 1.948 2.254 0.0369 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 288.7 on 18 degrees of freedom
## Multiple R-squared: 0.2202, Adjusted R-squared: 0.1769
## F-statistic: 5.082 on 1 and 18 DF, p-value: 0.03687
amazon$residuals <- amazon.lr$residuals
# residuals regression
amazon %>%
ggplot() +
geom_point(aes(x = diameter, y = residuals)) +
geom_hline(aes(yintercept = 0)) +
theme_bw() +
labs(title = "Residual plots of Residuals versus X",
x = "Tree diameter (cm)",
y = "Residuals") # TODO: fix labels here
# qq plot
amazon.aov <- aov(age ~ diameter, data = amazon)
amazon.res <- data.frame(residuals = amazon.aov$residuals)
amazon.res %>%
ggplot(aes(sample = residuals)) +
geom_qq() +
stat_qqline() +
coord_flip() +
labs(title = "Normal probability (Q-Q) plot of residuals",
x = "Theoretical quantile",
y = "Residuals")
Briefly describe what these two plots tell you about the regression assumptions. Regression statistical analysis assumes linearity (y = a + bx), normality (values of Y are normally distributed for a value of X), constant variance (equal spread of data in positive and negative direction for each value of X) and random sampling.
The residual plot shows presence of linearity, non-constant variance and outliers.
The Q-Q plot shows that there are three outliers in the residual data set, affording evidence for non-normally distributed data. According to both plots, data violates regression assumptions. c. Repeated analysis on data with log-transformed tree age,
# plot of transformed data
amazon %>%
ggplot(aes(x = diameter, y = log(age))) +
geom_point() +
geom_smooth(method = "lm",
se = F,
color = "black") +
labs(title = "Scatter plot of Log of Tree Age versus Tree Diameter",
x = "Tree diameter (cm)",
y = "Log of tree age (14C dating units)")
d. Linear regression of with log(age), describing whether the transformed data or the original data are better at meeting the assumptions of linear regression analysis.
# re-do regresion
amazon$log.age <- log(amazon$age)
amazon.transformed.lr <- lm(log.age ~ diameter, data = amazon)
# residuals plots
# residuals regression
amazon$log.residuals <- amazon.transformed.lr$residual
plot(log.residuals ~ diameter, data = amazon,
pch = 19,
xlab = "Tree Diameter (cm)",
ylab = "Residuals")
abline(lm(log.residuals ~ diameter, data = amazon))
# qq plot
amazon.transformed.aov <- aov(log.age ~ diameter, data = amazon)
amazon.transformed.res <- data.frame(residuals = amazon.transformed.aov$residuals)
amazon.transformed.res %>%
ggplot(aes(sample = residuals)) +
geom_qq() +
stat_qqline() +
coord_flip() +
labs(title = "Normal probability (Q-Q) plot of residuals",
x = "Theoretical quantile",
y = "Residuals")
Briefly describe whether the transformed data or the original data are better at meeting the assumptions of linear regression analysis. Transformed data was better than original data in meeting the assumptions of the data
From residual plot, more even spread on either side of the line, tending better to a more constant variance than that of the original data. Also, data also shows linearity and lacks any prominent outliers.
From Q-Q plot, line is a better fit than that of the original data, showing no outliers and thus suggesting that the residuals are indeed normally distributed.
tsas <- read.csv("TreeSeedlingsAndSunflecks.csv")
The study is experimental, as the researchers had control over each tree’s length of exposure to fleeks of sunlight. That is, they had control over assignment of predictor variables to groups.
A bivariate scatterplot with a best-fit line,
tsas %>%
ggplot(aes(x = fleckDuration, y = growth)) +
geom_point() +
geom_smooth(method = "lm",
se = F,
color = "black") +
labs(x = "Mean fleck duration (min)", y = "Relative growth rate (mm/mm/week)")
Researchers were trying to see how the length of exposure to flecks of sunlight affects growth of seedlings of understory trees in mature tropical rainforest.
Response variable Y => relative growth rate (mm/mm/week)
Predictor variable X => Mean fleck duration (min)
Scatter plot shows linearity with no outliers or influential points. c. Equation for the best-fit line: y = -0.00171 + 0.00253x , where
Y => relative growth rate (mm/mm/week) &
X => Mean fleck duration (min)
tsas.lm <- lm(growth ~ fleckDuration, data = tsas)
tsas.lm
##
## Call:
## lm(formula = growth ~ fleckDuration, data = tsas)
##
## Coefficients:
## (Intercept) fleckDuration
## -0.001709 0.002529
findGrowth <- function(x) {
(x * 0.0025295) - 0.0017087
}
findGrowth(4)
## [1] 0.0084093
findGrowth(10)
## [1] 0.0235863
i) 4 min fleck duration => 0.008 mm/mm/week relative growth rate
ii) 10 min fleck duration => 0.024 mm/mm/week relative growth rate
summary(tsas.lm)
##
## Call:
## lm(formula = growth ~ fleckDuration, data = tsas)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.007756 -0.003262 -0.000421 0.001991 0.011785
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0017087 0.0024716 -0.691 0.498
## fleckDuration 0.0025295 0.0004483 5.642 1.93e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.004488 on 19 degrees of freedom
## Multiple R-squared: 0.6262, Adjusted R-squared: 0.6066
## F-statistic: 31.84 on 1 and 19 DF, p-value: 1.934e-05
Adjusted R^2 = 0.607
tsas$residuals <- tsas.lm$residuals
ggplot(tsas, aes(sample = residuals)) +
geom_qq() +
stat_qqline() +
coord_flip()
ggplot(tsas) +
geom_point(aes(x = fleckDuration, y = residuals)) +
geom_hline(aes(yintercept = 0)) +
theme_bw()
The above Q-Q plot shows an outlier that violates residual normality distribution assumption.
Residual plot also shows this outlier > 0.010. Additionally, though it shows linearity that affirms a regression assumption, it depicts non-constant variance that violates another. 3. Analysis and test statistic
tsas.lm <- lm(growth ~ fleckDuration, data = tsas)
summary(tsas.lm)
##
## Call:
## lm(formula = growth ~ fleckDuration, data = tsas)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.007756 -0.003262 -0.000421 0.001991 0.011785
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0017087 0.0024716 -0.691 0.498
## fleckDuration 0.0025295 0.0004483 5.642 1.93e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.004488 on 19 degrees of freedom
## Multiple R-squared: 0.6262, Adjusted R-squared: 0.6066
## F-statistic: 31.84 on 1 and 19 DF, p-value: 1.934e-05
F statistic = 31.84 &
in form y = a + bx, the regression equation is y = - 0.00171 + (0.00253)x , where
Y => relative growth rate (mm/mm/week) &
X => Mean fleck duration (min) 4. p-value < 0.0001 5. Since p-value < 0.05, we reject the null hypothesis 6. Growth (mm/mm/week) of seedlings of understory trees in mature tropical rainforests increased significantly with and is linearly dependent on intermittent of flecks of sunlight (min) (Linear regression, F(1,19) = 31.84, p < 0.0001).
confint(tsas.lm)
## 2.5 % 97.5 %
## (Intercept) -0.006881674 0.003464357
## fleckDuration 0.001591156 0.003467768
0.0016, 0.0035 (mm/mm/week per min) OR
0.0253 +/- 0.0009 (mm/mm/week per min), in form estimated slope +/- 95% CI
Data from text on estimating cadaver birth dates.