Q1.
##################QUESTION 1############################
Fit MLR model
Dat <- read.csv("C:/Users/eclai/Downloads/chemical.csv")
mlr <- lm(y ~ x6 + x7, data = Dat)
summary(mlr)
##
## Call:
## lm(formula = y ~ x6 + x7, data = Dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.2035 -4.3713 0.2513 4.9339 21.9682
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.526460 3.610055 0.700 0.4908
## x6 0.018522 0.002747 6.742 5.66e-07 ***
## x7 2.185753 0.972696 2.247 0.0341 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.924 on 24 degrees of freedom
## Multiple R-squared: 0.6996, Adjusted R-squared: 0.6746
## F-statistic: 27.95 on 2 and 24 DF, p-value: 5.391e-07
equation of fitted line: y = 2.526460 + 0.018522(x6) + 2.185753(x7)
##################QUESTION 2############################
The assumptions we need to make for linear regression are:
##################QUESTION 3############################
Constructing scatter plots for response variable vs. the regressors, we clearly see that they do not support the assumption on linearity.
dat <- subset(Dat, select = c(1,7,8))
pairs(dat)
##################QUESTION 4############################
Plotting the residual errors against the fitted values can be used to check for zero mean and constant variance of random error.
plot(mlr$fitted.values, mlr$residuals,
main = "Check for 0 mean and constant var \n Residual vs. fitted value")
abline(h=0)
Here, we can say we approximately have a zero mean (Unbiased). When dividing the plot into thin vertical strips, we have an average value of zero in almost any thin vertical strip. We also have Constant variance (Homoscadastic) as the spread of the residuals are about the same in any thin vertical strip. We also see that there is not any clear pattern which is what we like to see.
##################QUESTION 5############################
Plotting the residual errors against the row numbers sorted by regressor variables helps us check for the assumption of independence of error. Here we can see that there is no obvious pattern for both plots meaning the assumption of independence of error is not violated.
row_num <- c(1:nrow(dat))
sort_x1 <- sort(dat$x6, index.return=TRUE)
plot(row_num, mlr$residuals[sort_x1$ix],
main = "Check for independence \n Residuals sorted by x6")
abline(h=0)
row_num <- c(1:nrow(dat))
sort_x1 <- sort(dat$x7, index.return=TRUE)
plot(row_num, mlr$residuals[sort_x1$ix],
main = "Check for independence \n Residuals sorted by x7")
abline(h=0)
##################QUESTION 6############################
A normal probability plot can be used to check for normality of random error
qqnorm(mlr$residuals)
qqline(mlr$residuals)
We see that our normal probability plot shows the points following the line almost exactly. However, in the beginning of the plot and towards the end, we see that it deviates slightly. This means we nearly have satisfied the condition for normality of error but it seems it is slightly skewed so we will reject it.
##################QUESTION 7############################
shapiro.test(mlr$residuals)
##
## Shapiro-Wilk normality test
##
## data: mlr$residuals
## W = 0.97358, p-value = 0.6981
p-value = 0.6981 We fail to reject H0 and conclude that data shows evidence that the errors do follow normal distribution.
Q2.
##################QUESTION 1############################
Fit SLR model
Dat1 <- Dat <- read.csv("C:/Users/eclai/Downloads/windmill.csv")
slr <- lm(y ~ x, data = Dat1)
summary(slr)
##
## Call:
## lm(formula = y ~ x, data = Dat1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.59869 -0.14099 0.06059 0.17262 0.32184
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.13088 0.12599 1.039 0.31
## x 0.24115 0.01905 12.659 7.55e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2361 on 23 degrees of freedom
## Multiple R-squared: 0.8745, Adjusted R-squared: 0.869
## F-statistic: 160.3 on 1 and 23 DF, p-value: 7.546e-12
equation of fitted line: hat(y) = 0.13088 + 0.24115(x)
##################QUESTION 2############################
The assumptions we need to make for linear regression are:
##################QUESTION 3############################
Constructing scatter plots for response variable vs. the regressors, we clearly see that they are slightly curved and not linear. This means they do not support the assumption on linearity.
pairs(Dat1)
Plotting the residual errors against the fitted values we that we do not have a zero mean. When dividing the plot into thin vertical strips, we can clearly see that we do not have average value of zero for almost any thin vertical strip. We also do not have Constant variance as the spread of the residuals are not the same in any thin vertical strip. We also see that there is a clear pattern or rather curved line which is not ideal.
plot(slr$fitted.values, slr$residuals,
main = "Check for 0 mean and constant var \n Residual vs. fitted value")
abline(h=0)
Plotting the residual errors against the row numbers sorted by regressor variables helps us check for the assumption of independence of error. Here we clearly see that the residuals show a pattern in the form of a concave down line. This violates the assumption of independence of error.
row_num <- c(1:nrow(Dat1))
sort_x1 <- sort(Dat1$x, index.return=TRUE)
plot(row_num, slr$residuals[sort_x1$ix],
main = "Check for independence \n Residuals sorted by x")
abline(h=0)
The normal probability plot shows us that we nearly have normality of random error but not quite. We can also see that towards the beginning and end of the plot, the points do not exaclty follow the line. The shape of the graph indicates that we have a heavily tailed distribution.
qqnorm(slr$residuals)
qqline(slr$residuals)
shapiro.test(slr$residuals)
##
## Shapiro-Wilk normality test
##
## data: slr$residuals
## W = 0.93587, p-value = 0.1188
Using the Shapiro-Wilk normality test, p-value = 0.1188 We fail to reject H0 and conclude that data shows evidence that the errors do follow normal distribution.
##################QUESTION 4############################
Our transformation will be squaring the y variable
slr1 <- lm((y^2) ~ x, data = Dat1)
summary(slr1)
##
## Call:
## lm(formula = (y^2) ~ x, data = Dat1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.74840 -0.31027 0.05951 0.30793 0.57072
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.35851 0.21239 -6.396 1.58e-06 ***
## x 0.71066 0.03211 22.130 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3979 on 23 degrees of freedom
## Multiple R-squared: 0.9551, Adjusted R-squared: 0.9532
## F-statistic: 489.7 on 1 and 23 DF, p-value: < 2.2e-16
We will now verify the assumptions
pairs((y^2) ~ x, data=Dat1)
plot(Dat1$x, (Dat1$y)^2,
main = "Check for linearity \n y^2 vs. x")
abline(h=0)
plot(slr1$fitted.values, slr1$residuals,
main = "Check for 0 mean and constant var \n Residual vs. fitted value")
abline(h=0)
row_num <- c(1:nrow(Dat1))
sort_x1 <- sort(Dat1$x, index.return=TRUE)
plot(row_num, slr1$residuals[sort_x1$ix],
main = "Check for independence \n Residuals sorted by x")
abline(h=0)
qqnorm(slr1$residuals)
qqline(slr1$residuals)
shapiro.test(slr1$residuals)
##
## Shapiro-Wilk normality test
##
## data: slr1$residuals
## W = 0.94769, p-value = 0.2223
Constructing scatter plots for our new model, we clearly see that they linear. This means we can verify the assumption of linearity. Plotting the residual errors against the fitted values we can verify the assumption of zero mean and constant variance. When dividing the plot into thin vertical strips, we can clearly see that we have an average value of zero for almost any thin vertical strip. We also have Constant variance as the spread of the residuals are not the same in any thin vertical strip. We also see that there is no clear pattern which is ideal. Plotting the residual errors against the row numbers sorted by regressor variables helps us check for the assumption of independence of error. Here we clearly see that the residuals do not show any particular pattern. This verifies the assumption of independence of error. The normal probability plot shows us that we can verify the assumption of normality of random error since the points follow the line linearly. Lastly, Using the Shapiro-Wilk normality test, p-value = 0.2223 We fail to reject H0 and conclude that data shows evidence that the errors follow normal distribution.
##################QUESTION 5############################
x=5, estimate y
x0 <- data.frame(x=5)
predict(slr1, x0, interval = "prediction", level = 0.95)
## fit lwr upr
## 1 2.194792 1.351937 3.037647
sqrt(1.351937)
## [1] 1.162728
sqrt(3.037647)
## [1] 1.742885
The mean DC_output for where wind_velocity = 5 mph can be predicted using the 95% interval of (sqrt(1.351937),sqrt(3.037647)) which is (1.162728,1.742885)
#######ALL OF THE R CODE USED IN ORDER##################
Q1.
Dat <- chemical mlr <- lm(y ~ x6 + x7, data = Dat) summary(mlr)
dat <- subset(Dat, select = c(1,7,8)) pairs(dat)
plot(mlr\(fitted.values, mlr\)residuals, main = “Check for 0 mean and constant var Residual vs. fitted value”) abline(h=0)
row_num <- c(1:nrow(dat)) sort_x1 <- sort(dat\(x6, index.return=TRUE) plot(row_num, mlr\)residuals[sort_x1$ix], main = “Check for independence Residuals sorted by x6”) abline(h=0)
row_num <- c(1:nrow(dat)) sort_x1 <- sort(dat\(x7, index.return=TRUE) plot(row_num, mlr\)residuals[sort_x1$ix], main = “Check for independence Residuals sorted by x7”) abline(h=0)
qqnorm(mlr\(residuals) qqline(mlr\)residuals) shapiro.test(mlr$residuals)
Q2.
Dat1 <- windmill slr <- lm(y ~ x, data = Dat1) summary(slr)
pairs(Dat1)
plot(slr\(fitted.values, slr\)residuals, main = “Check for 0 mean and constant var Residual vs. fitted value”) abline(h=0)
row_num <- c(1:nrow(Dat1)) sort_x1 <- sort(Dat1\(x, index.return=TRUE) plot(row_num, slr\)residuals[sort_x1$ix], main = “Check for independence Residuals sorted by x”) abline(h=0)
qqnorm(slr\(residuals) qqline(slr\)residuals)
shapiro.test(slr$residuals)
slr1 <- lm((y^2) ~ x, data = Dat1) summary(slr1)
pairs((y^2) ~ x, data=Dat1)
plot(Dat1\(x, (Dat1\)y)^2, main = “Check for linearity y^2 vs. x”) abline(h=0)
plot(slr1\(fitted.values, slr1\)residuals, main = “Check for 0 mean and constant var Residual vs. fitted value”) abline(h=0)
row_num <- c(1:nrow(Dat1)) sort_x1 <- sort(Dat1\(x, index.return=TRUE) plot(row_num, slr1\)residuals[sort_x1$ix], main = “Check for independence Residuals sorted by x”) abline(h=0)
qqnorm(slr1\(residuals) qqline(slr1\)residuals)
shapiro.test(slr1$residuals)