x <- c(10, 3, 12, 9, 13, 11)
y <- c(2.1, 11.3, 4.0, 1.1, 4.9, 2.8)
data <- data.frame(x,y)
Fit the data using simple linear regression and perform appropriate model diagnostics by checking the following. In each case, comment on what you are looking for and what you conclude based on your results. Include any plots you make as part of your submission.
There seems to be a departure from linearity.When x=3 seems to be an outlier, hoever an appropiate test should be carried on to determine whether it is an outlier or not. The linear model seems to fit the dataset except for the x=3 datapoint.
plot(x,y, main = "Scatter plot", xlab = "X", ylab = "Y")
To evaluate the assumption of constant variance it is useful to explore the raw residuals which are an estimate of the error.Two main plots are useful to look at the residuals and evaluate assumptions: a. QQplot of residuals it will allow us to evaluate for the normality assumption. b. Residual vrs fitted plots it is useful to evaluate constant variance and correctness of the model (an incorrect model would be trying to fit a straight line into curved data. To construct the mentioned plots above, a linear model should be fitted for the dataset.
model <- lm(y~x, data = data)
par(mfrow=c(1,2))
qqnorm(residuals(model), main = "QQplot of Residuals")
plot(residuals(model)~fitted(model), ylab = "Residuals", xlab= "Fitted Values", main= "Residuals vrs Fitted values")
abline(h=0)
The residuals vs fitted plot does not show a random scatter of points, and the QQ plot is relatively straight, so assumption of normaliity reasonably met but it appears that there is a deviation frin constant variance.
In the response variable by using both graphical checks and a formal test for outliers for any points you think warrant investigation. The first time you perform a formal outlier test, do it “the long way” by fitting the full data set and the reduced data set and calculating the T statistic based on ^ Ypred. Thereafter, you can use R to perform an outlier test if you would like.
x_reduced <- c(10, 12, 9, 13, 11)
y_reduced <- c(2.1, 4.0, 1.1, 4.9, 2.8)
reduced <- data.frame(x_reduced,y_reduced)
model_reduced <- lm(y_reduced~x_reduced, data = reduced)
summary(model_reduced)
##
## Call:
## lm(formula = y_reduced ~ x_reduced, data = reduced)
##
## Residuals:
## 1 2 3 4 5
## 0.07 0.07 0.02 0.02 -0.18
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.47000 0.41988 -17.79 0.000387 ***
## x_reduced 0.95000 0.03786 25.09 0.000139 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1197 on 3 degrees of freedom
## Multiple R-squared: 0.9953, Adjusted R-squared: 0.9937
## F-statistic: 629.7 on 1 and 3 DF, p-value: 0.0001388
pt(48.24,df=3,lower.tail = FALSE)
## [1] 9.807254e-06
The output below shows that in fact, the datapoint in question is an outlier. p-value corrected by bonferroni states that the datapoint is not within the range of sampling error.
library(car)
## Loading required package: carData
outlierTest(model)
## rstudent unadjusted p-value Bonferonni p
## 2 48.23504 1.9621e-05 0.00011772
Using graphical checks, and by computing the leverage and Cook’s distance for each point. You can use R to compute the leverages and Cook’s distances, but be aware that you should know how to compute these “the long way” as well. In the following data table I show Raw Residuals, Externally studentized residualds, Fitted Values, Leverage and Cooks Distance.
myrawresids <- residuals(model) #raw residuals
myextresids <- rstudent(model) #externally studentized residuals
myfits <- fitted(model) #fitted values
myhats <- hatvalues(model) #leverage or hat values
mycooks <- cooks.distance(model) #Cook's distance
table1 <- data.frame(myrawresids,myextresids,myfits,myhats,mycooks)
names(table1) <- c("raw residuals","ext. stud. residuals","fitted values","leverage ","Cook's distance")
table1
par(mfrow=c(1,2))
plot(myhats ~ c(1:6), main = "Index Plot of Leverage",
ylab = "Leverage",
xlab = "Observation Number")
abline(h=1)
plot(mycooks ~ c(1:6), main = "Index Plot of Cook's Distance",
ylab = "Cook's Distance",
xlab = "Observation Number")
abline(h=1)
Based on a small amount of data even when there are no violations of the assumptions. You will use simulations to create regression data where the error terms are known to follow the usual assumptions.
x: 2 4 6 8 10 12 14 16 18 20
y: 5 6 7 8 9 10 11 12 13 14
Note that these data perfectly follow the line: y = 4.0 + 0.5x. By simulation you will add some error to y. We will call this randomly simulated data Y* , thus producing data that follow the relationship Y* = 4.0 + 0.5x + e e where e~ N(0, 1).
If you repeat this process, you will generate a new set of Y* values. Do this for three different simulations, and generate residual plots (raw residuals vs Fitted values) and QQ plots of raw residuals for each simulation.
x1 <- c( 2, 4, 6, 8, 10, 12, 14, 16, 18, 20)
y1 <- c( 5, 6 ,7, 8, 9, 10, 11, 12, 13, 14)
model_data <- data.frame(x1,y1)
ystar1 <- y1 + rnorm(n=10, mean=0, sd=1)
ystar2 <- y1 + rnorm(n=10, mean=0, sd=1)
ystar3 <- y1 + rnorm(n=10, mean=0, sd=1)
linemodel <- lm(y1~x1, data = model_data)
model1 <- lm(ystar1~x1, data = model_data)
model2 <- lm(ystar2~x1, data = model_data)
model3 <- lm(ystar3~x1, data = model_data)
par(mfrow=c(1,1))
qqnorm(residuals(model1), main = "QQplot of Residuals #1a")
plot(residuals(model1)~fitted(model1), ylab = "Residuals", xlab= "Fitted Values", main= "Residuals vrs Fitted values #1 a")
abline(h=0)
qqnorm(residuals(model2), main = "QQplot of Residuals #2a")
plot(residuals(model2)~fitted(model2), ylab = "Residuals", xlab= "Fitted Values", main= "Residuals vrs Fitted values #2a ")
abline(h=0)
qqnorm(residuals(model3), main = "QQplot of Residuals #3a")
plot(residuals(model3)~fitted(model3), ylab = "Residuals", xlab= "Fitted Values", main= "Residuals vrs Fitted values #3a ")
abline(h=0)
x2 <- c(2, 7, 12, 14, 15, 16, 17, 18, 19, 20)
y2 <- c(5, 7.5, 10, 11, 11.5, 12, 12.5, 13, 13.5, 14)
simulation <- data.frame(x2,y2)
ystarb1 <- y2 + rnorm(n=10, mean=0, sd=1)
ystarb2 <- y2 + rnorm(n=10, mean=0, sd=1)
ystarb3 <- y2 + rnorm(n=10, mean=0, sd=1)
modelb1 <- lm(ystarb1~x2, data = simulation)
modelb2 <- lm(ystarb2~x2, data = simulation)
modelb3 <- lm(ystarb3~x2, data = simulation)
qqnorm(residuals(modelb1), main = "QQplot of Residuals #1b")
plot(residuals(modelb1)~fitted(modelb1), ylab = "Residuals", xlab= "Fitted Values", main= "Residuals vrs Fitted values #1b ")
abline(h=0)
qqnorm(residuals(modelb2), main = "QQplot of Residuals #2b")
plot(residuals(modelb2)~fitted(modelb2), ylab = "Residuals", xlab= "Fitted Values", main= "Residuals vrs Fitted values #2b ")
abline(h=0)
qqnorm(residuals(modelb3), main = "QQplot of Residuals #3b")
plot(residuals(modelb3)~fitted(modelb3), ylab = "Residuals", xlab= "Fitted Values", main= "Residuals vrs Fitted values #3b")
abline(h=0)
The growth of the mold Aspergillus niger is inhibited by the toxicant methyl 4-hydroxybenzoate. Various concentrations of the toxicant (x) are available. The inhibitions (y) are measured as growth rate changes (inmm) per 24 hours.
par(mfrow=c(1,1))
concentration_x <- c(0.32, 0.40, 0.51, 0.64, 0.80, 1.0, 1.3, 1.6, 2.0, 2.5, 3.2, 4.0)
inhibitions_y <- c( 2.0, 2.3, 2.4, 2.8, 3.0, 3.3, 3.6, 4.3, 4.8, 5.1, 5.6, 6.6 )
aspergillus <- data.frame(concentration_x,inhibitions_y)
aspergillus_model <- lm(inhibitions_y~concentration_x, data = aspergillus)
plot(concentration_x, inhibitions_y, main = "Scatter plot of Aspergillus inhibition", xlab = "Concentration", ylab = "Growth inhibition (mm/day)")
qqnorm(residuals(aspergillus_model), main = "QQplot of Residuals Aspergillus Inhibition")
plot(residuals(aspergillus_model)~fitted(aspergillus_model), ylab = "Residuals", xlab= "Fitted Values", main= "Residuals vrs Fitted values Aspergillus inhibition ")
abline(h=0)
#Log transformation on both x and y axis
log_concentration_x <- log(concentration_x)
log_inhibitions_y <- log(inhibitions_y)
log_aspergillus <- data.frame(log_concentration_x,log_inhibitions_y)
log_aspergillus_model <- lm(log_inhibitions_y~log_concentration_x, data = aspergillus)
plot(log_concentration_x, log_inhibitions_y, main = "Scatter plot ", xlab = " Log Concentration", ylab = "Log inhibition (mm/day)")
qqnorm(residuals(log_aspergillus_model), main = "QQplot of Residuals ")
plot(residuals(log_aspergillus_model)~fitted(log_aspergillus_model), ylab = "Residuals", xlab= "Fitted Values", main= "Residuals vrs Fitted values ")
abline(h=0)
#log transformation only on the y axis
concentration_x <- concentration_x
log_inhibitions_y <- log(inhibitions_y)
log_aspergillus <- data.frame(log_concentration_x,log_inhibitions_y)
log_aspergillus_model_1 <- lm(inhibitions_y~concentration_x, data = aspergillus)
plot(concentration_x, log_inhibitions_y, main = "Scatter plot", xlab = " Log Concentration", ylab = "Log Growth inhibition (mm/day)")
qqnorm(residuals(log_aspergillus_model_1), main = "QQplot of Residuals ")
plot(residuals(log_aspergillus_model_1)~fitted(log_aspergillus_model_1), ylab = "Residuals", xlab= "Fitted Values", main= "Residuals vrs Fitted values ")
abline(h=0)
#square root transform of both axis (correct)
sqrt_concentration_x <- sqrt(concentration_x)
sqrt_inhibitions_y <- sqrt(inhibitions_y)
sqrt_aspergillus <- data.frame(sqrt_concentration_x,sqrt_inhibitions_y)
sqrt_aspergillus_model <- lm(sqrt_inhibitions_y~sqrt_concentration_x, data = aspergillus)
plot(sqrt_concentration_x, sqrt_inhibitions_y, main = "Scatter plot", xlab = " sqrt Concentration", ylab = "sqrt inhibition (mm/day)")
qqnorm(residuals(sqrt_aspergillus_model), main = "QQplot of Residuals")
plot(residuals(sqrt_aspergillus_model)~fitted(sqrt_aspergillus_model), ylab = "Residuals", xlab= "Fitted Values", main= "Residuals vrs Fitted")
abline(h=0)
#square root transform of only y axis
concentration_x <- concentration_x
sqrt_inhibitions_y <- sqrt(inhibitions_y)
sqrt_aspergillus <- data.frame(concentration_x,sqrt_inhibitions_y)
sqrt_aspergillus_model_ <- lm(sqrt_inhibitions_y~concentration_x, data = aspergillus)
plot(sqrt_concentration_x, sqrt_inhibitions_y, main = "Scatter plot of Aspergillus inhibition", xlab = "Concentration", ylab = "sqrt Growth inhibition (mm/day)")
qqnorm(residuals(sqrt_aspergillus_model_), main = "QQplot of Residuals Aspergillus Inhibition")
plot(residuals(sqrt_aspergillus_model_)~fitted(sqrt_aspergillus_model_), ylab = "Residuals", xlab= "Fitted Values", main= "Residuals vrs Fitted values Aspergillus inhibition ")
abline(h=0)
myrawresid <- residuals(sqrt_aspergillus_model) #raw residuals
myextresid <- rstudent(sqrt_aspergillus_model) #externally studentized residuals
residuals <- data.frame(myrawresid,myextresid)
names(residuals) <- c("Raw Residuals", "Externally Studentized Residuals")
residuals
area <- c(530, 630, 840, 900, 1020, 1020, 1370, 1440, 1760, 2210, 2400, 2400, 2810, 3140, 3360 ) #x
crown <- c( 2.8, 3.3, 3.7, 3.7, 4.1, 3.9, 5.1, 5.7, 7.2, 8.0, 9.0, 9.6, 10.5, 12.4, 14.6)
#y
forest <- data.frame(area,crown)
forest_mod <- lm(crown~area)
plot(area, crown, main = "Scatter plot", xlab = "area (cm2)", ylab = "crown (m)" )
qqnorm(residuals(forest_mod), main = "QQplot")
plot(residuals(forest_mod)~fitted(forest_mod), ylab = "Residual", xlab= "Fitted Values", main= "Residuals vrs Fitted values")
abline(h=0)
Scatter plot informs that the linearity relationship of the model is correct. Nevertheless, residuals vs fitted values does not show a random pattern what leads to the conclusion that the linearity assumption is not met. QQ-plot shows no obvious deviation from normality. As shown in the Scatter plot, the scale of the predictive variable is bigger by a factor of about thousand which could be interfering with the quality of the regression model. A transformation on the x axis could be help alleviate the assumptions that are not met.
#Log transformation of the x axis
log_forest_mod <- lm(crown~log(area))
plot(log(area), crown, main = "Scatter plot", xlab = "Log(area (cm2))", ylab = "crown (m)" )
qqnorm(residuals(log_forest_mod), main = "QQplot")
plot(residuals(log_forest_mod)~fitted(log_forest_mod), ylab = "Residual", xlab= "Fitted Values", main= "Residuals vrs Fitted values")
abline(h=0)
Residuals vs fitted values does not show a random pattern what leads to the conclusion that the linearity assumption is not met. QQ-plot shows no obvious deviation from normality and scatter plot also shows that the linearity is not met, which indeed is the most important assumption to comply with. This transformation is not useful.
#Log transformation of both axes
logforest_mod <- lm(log(crown)~log(area))
plot(log(area), log(crown), main = "Scatter plot", xlab = "Log(area (cm2))", ylab = "Log(crown (m))" )
qqnorm(residuals(logforest_mod), main = "QQplot")
plot(residuals(logforest_mod)~fitted(logforest_mod), ylab = "Residual", xlab= "Fitted Values", main= "Residuals vrs Fitted values")
abline(h=0)
#Log transformation of the y axis
log_forest_mod_y <- lm(log(crown)~area)
plot(area, log(crown), main = "Scatter plot", xlab = "area (cm2)", ylab = "Log(crown (m))" )
qqnorm(residuals(log_forest_mod_y), main = "QQplot")
plot(residuals(log_forest_mod_y)~fitted(log_forest_mod_y), ylab = "Residual", xlab= "Fitted Values", main= "Residuals vrs Fitted values")
abline(h=0)
A log transformation on the y axis shows as in the case above (transformation on both axes) that the linearity of the model is compromised. Both scatter plot and residuals vs. fitted values inform of a non-linearity. Normality seems to be not violated. There seems to be a “curve” as shown in the residuals vs. fitted values in the opposite direction (concave down) of thelog transformation only on the x axis (concave up). The transformation does not solve the problem of constant variance or the linearity.
#square root transformation of the x axis
sqrt_forest_mod <- lm(crown~sqrt(area))
plot(sqrt(area), crown, main = "Scatter plot", xlab = "Sqrt(area (cm2))", ylab = "crown (m)" )
qqnorm(residuals(sqrt_forest_mod), main = "QQplot")
plot(residuals(sqrt_forest_mod)~fitted(sqrt_forest_mod), ylab = "Residual", xlab= "Fitted Values", main= "Residuals vrs Fitted values")
abline(h=0)
#square root transformation of the y axis (Suitable transformation)
sqrt_forest_mod_y <- lm(sqrt(crown)~area)
plot(area, sqrt(crown), main = "Scatter plot", xlab = "area (cm2)", ylab = "sqrt(crown (m))" )
qqnorm(residuals(sqrt_forest_mod_y), main = "QQplot")
plot(residuals(sqrt_forest_mod_y)~fitted(sqrt_forest_mod_y), ylab = "Residual", xlab= "Fitted Values", main= "Residuals vrs Fitted values")
abline(h=0)
This is so far the best transformation. Normality shows no signs of obvious deviation, the scatter plot informs of a linear relationship between regressor and response variables and the residual vrs fitted value shows a random pattern in the distribution of the data points. This is the best transformation, because all the assumptions (Normality, constant variance are met. According to the hierarchy of assumptions, linearity and constant variance should be met at the same time. However, to be consistent below it is shown the square root transformation of both axes.
#square root transformation of both axes
sqrt_forest_mod_ <- lm(sqrt(crown)~sqrt(area))
plot(sqrt(area), sqrt(crown), main = "Scatter plot", xlab = "Sqrt(area (cm2))", ylab = "Sqrt(crown (m))" )
qqnorm(residuals(sqrt_forest_mod_), main = "QQplot")
plot(residuals(sqrt_forest_mod_)~fitted(sqrt_forest_mod_), ylab = "Residual", xlab= "Fitted Values", main= "Residuals vrs Fitted values")
abline(h=0)
A square root transformation on both x and y axes shows that the linearity of the model is compromised. Both scatter plot and residuals vs. fitted values inform of a non-linearity. Normality seems to be not violated. There seems to be a “curve”(concave up) as shown in the residuals vs. fitted values this is evidence that constant variance and linearity are not met. This transformation did not remedy the non-linearity problem that we are facing.
summary(sqrt_forest_mod_y)
##
## Call:
## lm(formula = sqrt(crown) ~ area)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.08907 -0.03992 -0.01405 0.05232 0.11087
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.3034987 0.0355957 36.62 1.67e-14 ***
## area 0.0007210 0.0000183 39.40 6.50e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06415 on 13 degrees of freedom
## Multiple R-squared: 0.9917, Adjusted R-squared: 0.9911
## F-statistic: 1553 on 1 and 13 DF, p-value: 6.505e-15
predict(sqrt_forest_mod_y,newdata=data.frame(area=1900),interval="prediction")
## fit lwr upr
## 1 2.673353 2.530047 2.816658