1.Consider the following data set:

x <-  c(10, 3, 12, 9, 13, 11)
y <-  c(2.1, 11.3, 4.0, 1.1, 4.9, 2.8)
data <- data.frame(x,y)

Fit the data using simple linear regression and perform appropriate model diagnostics by checking the following. In each case, comment on what you are looking for and what you conclude based on your results. Include any plots you make as part of your submission.

(a) Check the linearity, constant variance, and normality assumptions graphically

Lineanrity

There seems to be a departure from linearity.When x=3 seems to be an outlier, hoever an appropiate test should be carried on to determine whether it is an outlier or not. The linear model seems to fit the dataset except for the x=3 datapoint.

plot(x,y, main = "Scatter plot", xlab = "X", ylab = "Y")

Normality and Constant Variance

To evaluate the assumption of constant variance it is useful to explore the raw residuals which are an estimate of the error.Two main plots are useful to look at the residuals and evaluate assumptions: a. QQplot of residuals it will allow us to evaluate for the normality assumption. b. Residual vrs fitted plots it is useful to evaluate constant variance and correctness of the model (an incorrect model would be trying to fit a straight line into curved data. To construct the mentioned plots above, a linear model should be fitted for the dataset.

model <- lm(y~x, data = data)
par(mfrow=c(1,2))
qqnorm(residuals(model), main = "QQplot of Residuals")
plot(residuals(model)~fitted(model), ylab = "Residuals", xlab= "Fitted Values", main= "Residuals vrs Fitted values")
abline(h=0)

The residuals vs fitted plot does not show a random scatter of points, and the QQ plot is relatively straight, so assumption of normaliity reasonably met but it appears that there is a deviation frin constant variance.

(b) Check for outliers

In the response variable by using both graphical checks and a formal test for outliers for any points you think warrant investigation. The first time you perform a formal outlier test, do it “the long way” by fitting the full data set and the reduced data set and calculating the T statistic based on ^ Ypred. Thereafter, you can use R to perform an outlier test if you would like.

Long Way Outlier test

  1. Delete the questionable point
  2. Fit the regression line to the reduced data set
  3. Use the regression line from the reduced data set to predict the Y -value that would correspond to x = in question, the x-value of the deleted observation.
  4. Test whether the questionable observation is within sampling error of it predicted value. We can think of this test as corresponding to Ho : Yobserved = Ypredicted versus Ha : Yobserved ??? Ypredicted. This results in a T-test
x_reduced <-   c(10,  12, 9, 13, 11)
y_reduced <-  c(2.1,  4.0, 1.1, 4.9, 2.8)
reduced <- data.frame(x_reduced,y_reduced)
model_reduced <- lm(y_reduced~x_reduced, data = reduced)
summary(model_reduced)
## 
## Call:
## lm(formula = y_reduced ~ x_reduced, data = reduced)
## 
## Residuals:
##     1     2     3     4     5 
##  0.07  0.07  0.02  0.02 -0.18 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.47000    0.41988  -17.79 0.000387 ***
## x_reduced    0.95000    0.03786   25.09 0.000139 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1197 on 3 degrees of freedom
## Multiple R-squared:  0.9953, Adjusted R-squared:  0.9937 
## F-statistic: 629.7 on 1 and 3 DF,  p-value: 0.0001388
pt(48.24,df=3,lower.tail = FALSE)
## [1] 9.807254e-06

Outkier test using R

The output below shows that in fact, the datapoint in question is an outlier. p-value corrected by bonferroni states that the datapoint is not within the range of sampling error.

library(car)
## Loading required package: carData
outlierTest(model) 
##   rstudent unadjusted p-value Bonferonni p
## 2 48.23504         1.9621e-05   0.00011772

(c) Check for inuential observation(s)

Using graphical checks, and by computing the leverage and Cook’s distance for each point. You can use R to compute the leverages and Cook’s distances, but be aware that you should know how to compute these “the long way” as well. In the following data table I show Raw Residuals, Externally studentized residualds, Fitted Values, Leverage and Cooks Distance.

myrawresids <- residuals(model) #raw residuals
myextresids <- rstudent(model) #externally studentized residuals
myfits <- fitted(model) #fitted values
myhats <- hatvalues(model) #leverage or hat values
mycooks <- cooks.distance(model) #Cook's distance
table1 <- data.frame(myrawresids,myextresids,myfits,myhats,mycooks)
names(table1) <- c("raw residuals","ext. stud. residuals","fitted values","leverage ","Cook's distance")
table1
Graphical checks for Influential observations
par(mfrow=c(1,2))
plot(myhats ~ c(1:6), main = "Index Plot of Leverage",
ylab = "Leverage",
xlab = "Observation Number") 
abline(h=1)

plot(mycooks ~ c(1:6), main = "Index Plot of Cook's Distance",
ylab = "Cook's Distance",
xlab = "Observation Number") 
abline(h=1)

2. indicate the degree of variability that can be present in plots of residuals

Based on a small amount of data even when there are no violations of the assumptions. You will use simulations to create regression data where the error terms are known to follow the usual assumptions.

(a) consider the following data

x: 2 4 6 8 10 12 14 16 18 20
y: 5 6 7 8 9 10 11 12 13 14

Note that these data perfectly follow the line: y = 4.0 + 0.5x. By simulation you will add some error to y. We will call this randomly simulated data Y* , thus producing data that follow the relationship Y* = 4.0 + 0.5x + e e where e~ N(0, 1).
If you repeat this process, you will generate a new set of Y* values. Do this for three different simulations, and generate residual plots (raw residuals vs Fitted values) and QQ plots of raw residuals for each simulation.

x1 <- c( 2, 4, 6, 8, 10, 12, 14, 16, 18, 20)
y1 <- c( 5, 6 ,7, 8, 9, 10, 11, 12, 13, 14)
model_data <- data.frame(x1,y1)

ystar1 <-  y1 + rnorm(n=10, mean=0, sd=1)
ystar2 <-  y1 + rnorm(n=10, mean=0, sd=1)
ystar3  <-  y1 + rnorm(n=10, mean=0, sd=1)
linemodel <- lm(y1~x1, data = model_data)
model1 <- lm(ystar1~x1, data = model_data)
model2 <- lm(ystar2~x1, data = model_data)
model3 <- lm(ystar3~x1, data = model_data)
par(mfrow=c(1,1))
qqnorm(residuals(model1), main = "QQplot of Residuals #1a")

plot(residuals(model1)~fitted(model1), ylab = "Residuals", xlab= "Fitted Values", main= "Residuals vrs Fitted values #1 a")
abline(h=0)

qqnorm(residuals(model2), main = "QQplot of Residuals #2a")

plot(residuals(model2)~fitted(model2), ylab = "Residuals", xlab= "Fitted Values", main= "Residuals vrs Fitted values #2a ")
abline(h=0)

qqnorm(residuals(model3), main = "QQplot of Residuals #3a")

plot(residuals(model3)~fitted(model3), ylab = "Residuals", xlab= "Fitted Values", main= "Residuals vrs Fitted values #3a ")
abline(h=0)

  1. Repeat part (a) using the following values of x and y:
x2 <- c(2, 7, 12, 14, 15, 16, 17, 18, 19, 20)
y2 <- c(5, 7.5, 10, 11, 11.5, 12, 12.5, 13, 13.5, 14) 
simulation <- data.frame(x2,y2)

ystarb1 <-  y2 + rnorm(n=10, mean=0, sd=1)
ystarb2 <-  y2 + rnorm(n=10, mean=0, sd=1)
ystarb3  <-  y2 + rnorm(n=10, mean=0, sd=1)
modelb1 <- lm(ystarb1~x2, data = simulation)
modelb2 <- lm(ystarb2~x2, data = simulation)
modelb3 <- lm(ystarb3~x2, data = simulation)

qqnorm(residuals(modelb1), main = "QQplot of Residuals #1b")

plot(residuals(modelb1)~fitted(modelb1), ylab = "Residuals", xlab= "Fitted Values", main= "Residuals vrs Fitted values #1b ")
abline(h=0)

qqnorm(residuals(modelb2), main = "QQplot of Residuals #2b")

plot(residuals(modelb2)~fitted(modelb2), ylab = "Residuals", xlab= "Fitted Values", main= "Residuals vrs Fitted values #2b ")
abline(h=0)

qqnorm(residuals(modelb3), main = "QQplot of Residuals #3b")

plot(residuals(modelb3)~fitted(modelb3), ylab = "Residuals", xlab= "Fitted Values", main= "Residuals vrs Fitted values #3b")
abline(h=0)

3. Consider the following:

The growth of the mold Aspergillus niger is inhibited by the toxicant methyl 4-hydroxybenzoate. Various concentrations of the toxicant (x) are available. The inhibitions (y) are measured as growth rate changes (inmm) per 24 hours.

  1. Fit a simple linear regression model to the data Create a scatterplot, QQ plot, and raw residuals vs. fitted values plot for the fit. Comment on what you see.
par(mfrow=c(1,1))
concentration_x <- c(0.32, 0.40, 0.51, 0.64, 0.80, 1.0, 1.3, 1.6, 2.0, 2.5, 3.2, 4.0)
inhibitions_y <- c( 2.0, 2.3, 2.4, 2.8, 3.0, 3.3, 3.6, 4.3, 4.8, 5.1, 5.6, 6.6 )
aspergillus <- data.frame(concentration_x,inhibitions_y)
aspergillus_model <- lm(inhibitions_y~concentration_x, data = aspergillus)
plot(concentration_x, inhibitions_y, main = "Scatter plot of Aspergillus inhibition", xlab = "Concentration", ylab = "Growth inhibition (mm/day)")

qqnorm(residuals(aspergillus_model), main = "QQplot of Residuals Aspergillus Inhibition")

plot(residuals(aspergillus_model)~fitted(aspergillus_model), ylab = "Residuals", xlab= "Fitted Values", main= "Residuals vrs Fitted values Aspergillus inhibition ")
abline(h=0)

  1. If you are not satisfied with your fit in (a) (You shouldn’t be), find an appropriate simple linear regression model for which the usual assumptions appear more justified. You might need to consider transformations of x or y, or both. Create a scatterplot, QQ plot, and raw residuals vs. fitted values plot for the model _t to the transformed data that you think looks the best. Comment on what you see.
#Log transformation on both x and y axis
log_concentration_x <- log(concentration_x)
log_inhibitions_y <- log(inhibitions_y)
log_aspergillus <- data.frame(log_concentration_x,log_inhibitions_y)
log_aspergillus_model <- lm(log_inhibitions_y~log_concentration_x, data = aspergillus)
plot(log_concentration_x, log_inhibitions_y, main = "Scatter plot ", xlab = " Log Concentration", ylab = "Log inhibition (mm/day)")

qqnorm(residuals(log_aspergillus_model), main = "QQplot of Residuals ")

plot(residuals(log_aspergillus_model)~fitted(log_aspergillus_model), ylab = "Residuals", xlab= "Fitted Values", main= "Residuals vrs Fitted values  ")
abline(h=0)

#log transformation only on the y axis

concentration_x <- concentration_x
log_inhibitions_y <- log(inhibitions_y)
log_aspergillus <- data.frame(log_concentration_x,log_inhibitions_y)
log_aspergillus_model_1 <- lm(inhibitions_y~concentration_x, data = aspergillus)
plot(concentration_x, log_inhibitions_y, main = "Scatter plot", xlab = " Log Concentration", ylab = "Log Growth inhibition (mm/day)")

qqnorm(residuals(log_aspergillus_model_1), main = "QQplot of Residuals ")

plot(residuals(log_aspergillus_model_1)~fitted(log_aspergillus_model_1), ylab = "Residuals", xlab= "Fitted Values", main= "Residuals vrs Fitted values ")
abline(h=0)

#square root transform of both axis (correct)

sqrt_concentration_x <- sqrt(concentration_x)
sqrt_inhibitions_y <- sqrt(inhibitions_y)
sqrt_aspergillus <- data.frame(sqrt_concentration_x,sqrt_inhibitions_y)
sqrt_aspergillus_model <- lm(sqrt_inhibitions_y~sqrt_concentration_x, data = aspergillus)
plot(sqrt_concentration_x, sqrt_inhibitions_y, main = "Scatter plot", xlab = " sqrt Concentration", ylab = "sqrt inhibition (mm/day)")

qqnorm(residuals(sqrt_aspergillus_model), main = "QQplot of Residuals")

plot(residuals(sqrt_aspergillus_model)~fitted(sqrt_aspergillus_model), ylab = "Residuals", xlab= "Fitted Values", main= "Residuals vrs Fitted")
abline(h=0)

#square root transform of only y axis

concentration_x <- concentration_x
sqrt_inhibitions_y <- sqrt(inhibitions_y)
sqrt_aspergillus <- data.frame(concentration_x,sqrt_inhibitions_y)
sqrt_aspergillus_model_ <- lm(sqrt_inhibitions_y~concentration_x, data = aspergillus)
plot(sqrt_concentration_x, sqrt_inhibitions_y, main = "Scatter plot of Aspergillus inhibition", xlab = "Concentration", ylab = "sqrt Growth inhibition (mm/day)")

qqnorm(residuals(sqrt_aspergillus_model_), main = "QQplot of Residuals Aspergillus Inhibition")

plot(residuals(sqrt_aspergillus_model_)~fitted(sqrt_aspergillus_model_), ylab = "Residuals", xlab= "Fitted Values", main= "Residuals vrs Fitted values Aspergillus inhibition ")
abline(h=0)

myrawresid <- residuals(sqrt_aspergillus_model) #raw residuals
myextresid <- rstudent(sqrt_aspergillus_model) #externally studentized residuals
residuals <- data.frame(myrawresid,myextresid)
names(residuals) <- c("Raw Residuals", "Externally Studentized Residuals")
residuals
  1. A forester is interested in relating the crown diameter of a tree (in meters) to the stem area at breast height (in cm2). Observations on stem area x and the crown diameter y are given below.
 area <- c(530, 630, 840, 900, 1020, 1020, 1370, 1440, 1760, 2210, 2400, 2400, 2810, 3140, 3360  ) #x

crown <- c( 2.8, 3.3, 3.7, 3.7, 4.1, 3.9, 5.1, 5.7, 7.2, 8.0, 9.0, 9.6, 10.5, 12.4, 14.6)  
#y
forest <- data.frame(area,crown)
  1. Fit a simple linear regression model to the data. Create a scatterplot, QQ plot, and raw residuals vs. fitted values plot for the fit. Comment on what you see.
forest_mod <- lm(crown~area)
plot(area, crown, main = "Scatter plot", xlab = "area (cm2)", ylab = "crown (m)" )

qqnorm(residuals(forest_mod), main = "QQplot")

plot(residuals(forest_mod)~fitted(forest_mod), ylab = "Residual", xlab= "Fitted Values", main= "Residuals vrs Fitted values")
abline(h=0)

Scatter plot informs that the linearity relationship of the model is correct. Nevertheless, residuals vs fitted values does not show a random pattern what leads to the conclusion that the linearity assumption is not met. QQ-plot shows no obvious deviation from normality. As shown in the Scatter plot, the scale of the predictive variable is bigger by a factor of about thousand which could be interfering with the quality of the regression model. A transformation on the x axis could be help alleviate the assumptions that are not met.

#Log transformation of the x axis
log_forest_mod <- lm(crown~log(area))
plot(log(area), crown, main = "Scatter plot", xlab = "Log(area (cm2))", ylab = "crown (m)" )

qqnorm(residuals(log_forest_mod), main = "QQplot")

plot(residuals(log_forest_mod)~fitted(log_forest_mod), ylab = "Residual", xlab= "Fitted Values", main= "Residuals vrs Fitted values")
abline(h=0)

Residuals vs fitted values does not show a random pattern what leads to the conclusion that the linearity assumption is not met. QQ-plot shows no obvious deviation from normality and scatter plot also shows that the linearity is not met, which indeed is the most important assumption to comply with. This transformation is not useful.

#Log transformation of both axes
logforest_mod <- lm(log(crown)~log(area))
plot(log(area), log(crown), main = "Scatter plot", xlab = "Log(area (cm2))", ylab = "Log(crown (m))" )

qqnorm(residuals(logforest_mod), main = "QQplot")

plot(residuals(logforest_mod)~fitted(logforest_mod), ylab = "Residual", xlab= "Fitted Values", main= "Residuals vrs Fitted values")
abline(h=0)

#Log transformation of the y axis
log_forest_mod_y <- lm(log(crown)~area)
plot(area, log(crown), main = "Scatter plot", xlab = "area (cm2)", ylab = "Log(crown (m))" )

qqnorm(residuals(log_forest_mod_y), main = "QQplot")

plot(residuals(log_forest_mod_y)~fitted(log_forest_mod_y), ylab = "Residual", xlab= "Fitted Values", main= "Residuals vrs Fitted values")
abline(h=0)

A log transformation on the y axis shows as in the case above (transformation on both axes) that the linearity of the model is compromised. Both scatter plot and residuals vs. fitted values inform of a non-linearity. Normality seems to be not violated. There seems to be a “curve” as shown in the residuals vs. fitted values in the opposite direction (concave down) of thelog transformation only on the x axis (concave up). The transformation does not solve the problem of constant variance or the linearity.

#square root transformation of the x axis
sqrt_forest_mod <- lm(crown~sqrt(area))
plot(sqrt(area), crown, main = "Scatter plot", xlab = "Sqrt(area (cm2))", ylab = "crown (m)" )

qqnorm(residuals(sqrt_forest_mod), main = "QQplot")

plot(residuals(sqrt_forest_mod)~fitted(sqrt_forest_mod), ylab = "Residual", xlab= "Fitted Values", main= "Residuals vrs Fitted values")
abline(h=0)

#square root transformation of the y axis (Suitable transformation)
sqrt_forest_mod_y <- lm(sqrt(crown)~area)
plot(area, sqrt(crown), main = "Scatter plot", xlab = "area (cm2)", ylab = "sqrt(crown (m))" )

qqnorm(residuals(sqrt_forest_mod_y), main = "QQplot")

plot(residuals(sqrt_forest_mod_y)~fitted(sqrt_forest_mod_y), ylab = "Residual", xlab= "Fitted Values", main= "Residuals vrs Fitted values")
abline(h=0)

This is so far the best transformation. Normality shows no signs of obvious deviation, the scatter plot informs of a linear relationship between regressor and response variables and the residual vrs fitted value shows a random pattern in the distribution of the data points. This is the best transformation, because all the assumptions (Normality, constant variance are met. According to the hierarchy of assumptions, linearity and constant variance should be met at the same time. However, to be consistent below it is shown the square root transformation of both axes.

#square root transformation of both axes
sqrt_forest_mod_ <- lm(sqrt(crown)~sqrt(area))
plot(sqrt(area), sqrt(crown), main = "Scatter plot", xlab = "Sqrt(area (cm2))", ylab = "Sqrt(crown (m))" )

qqnorm(residuals(sqrt_forest_mod_), main = "QQplot")

plot(residuals(sqrt_forest_mod_)~fitted(sqrt_forest_mod_), ylab = "Residual", xlab= "Fitted Values", main= "Residuals vrs Fitted values")
abline(h=0)

A square root transformation on both x and y axes shows that the linearity of the model is compromised. Both scatter plot and residuals vs. fitted values inform of a non-linearity. Normality seems to be not violated. There seems to be a “curve”(concave up) as shown in the residuals vs. fitted values this is evidence that constant variance and linearity are not met. This transformation did not remedy the non-linearity problem that we are facing.

  1. For your chosen model in part (b), give a point estimate and create a 95% confidence interval for the predicted (single observation) crown diameter (in meters) when the stem area is 1900 cm2.
summary(sqrt_forest_mod_y)
## 
## Call:
## lm(formula = sqrt(crown) ~ area)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.08907 -0.03992 -0.01405  0.05232  0.11087 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.3034987  0.0355957   36.62 1.67e-14 ***
## area        0.0007210  0.0000183   39.40 6.50e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06415 on 13 degrees of freedom
## Multiple R-squared:  0.9917, Adjusted R-squared:  0.9911 
## F-statistic:  1553 on 1 and 13 DF,  p-value: 6.505e-15
predict(sqrt_forest_mod_y,newdata=data.frame(area=1900),interval="prediction")
##        fit      lwr      upr
## 1 2.673353 2.530047 2.816658