Homework 4

Q1.

##################QUESTION 1############################

Fit MLR model

Dat <- read.csv("C:/Users/eclai/Downloads/chemical.csv")
mlr <- lm(y ~ x6 + x7, data = Dat)
summary(mlr)

## 
## Call:
## lm(formula = y ~ x6 + x7, data = Dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.2035  -4.3713   0.2513   4.9339  21.9682 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.526460   3.610055   0.700   0.4908    
## x6          0.018522   0.002747   6.742 5.66e-07 ***
## x7          2.185753   0.972696   2.247   0.0341 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.924 on 24 degrees of freedom
## Multiple R-squared:  0.6996, Adjusted R-squared:  0.6746 
## F-statistic: 27.95 on 2 and 24 DF,  p-value: 5.391e-07

equation of fitted line: y = 2.526460 + 0.018522(x6) + 2.185753(x7)

##################QUESTION 2############################

The assumptions we need to make for linear regression are:

The relationship between response y and the regressors is linear, at least approximately
The random error term ε has zero mean and constant variance σ2
The errors are uncorrelated
The errors are normally distributed

##################QUESTION 3############################

Constructing scatter plots for response variable vs. the regressors, we clearly see that they do not support the assumption on linearity.

dat <- subset(Dat, select = c(1,7,8))
pairs(dat)

##################QUESTION 4############################

Plotting the residual errors against the fitted values can be used to check for zero mean and constant variance of random error.

plot(mlr$fitted.values, mlr$residuals, 
     main = "Check for 0 mean and constant var \n  Residual vs. fitted value")
abline(h=0)

Here, we can say we approximately have a zero mean (Unbiased). When dividing the plot into thin vertical strips, we have an average value of zero in almost any thin vertical strip. We also have Constant variance (Homoscadastic) as the spread of the residuals are about the same in any thin vertical strip. We also see that there is not any clear pattern which is what we like to see.

##################QUESTION 5############################

Plotting the residual errors against the row numbers sorted by regressor variables helps us check for the assumption of independence of error. Here we can see that there is no obvious pattern for both plots meaning the assumption of independence of error is not violated.

row_num <- c(1:nrow(dat))
sort_x1 <- sort(dat$x6, index.return=TRUE)
plot(row_num, mlr$residuals[sort_x1$ix],
main = "Check for independence \n Residuals sorted by x6")
abline(h=0)

row_num <- c(1:nrow(dat))
sort_x1 <- sort(dat$x7, index.return=TRUE)
plot(row_num, mlr$residuals[sort_x1$ix],
main = "Check for independence \n Residuals sorted by x7")
abline(h=0)

##################QUESTION 6############################

A normal probability plot can be used to check for normality of random error

qqnorm(mlr$residuals)
qqline(mlr$residuals)

We see that our normal probability plot shows the points following the line almost exactly. However, in the beginning of the plot and towards the end, we see that it deviates slightly. This means we nearly have satisfied the condition for normality of error but it seems it is slightly skewed so we will reject it.

##################QUESTION 7############################

shapiro.test(mlr$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  mlr$residuals
## W = 0.97358, p-value = 0.6981

p-value = 0.6981 We fail to reject H0 and conclude that data shows evidence that the errors do follow normal distribution.

Q2.

##################QUESTION 1############################

Fit SLR model

Dat1 <- Dat <- read.csv("C:/Users/eclai/Downloads/windmill.csv")
slr <- lm(y ~ x, data = Dat1)
summary(slr)

## 
## Call:
## lm(formula = y ~ x, data = Dat1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.59869 -0.14099  0.06059  0.17262  0.32184 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.13088    0.12599   1.039     0.31    
## x            0.24115    0.01905  12.659 7.55e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2361 on 23 degrees of freedom
## Multiple R-squared:  0.8745, Adjusted R-squared:  0.869 
## F-statistic: 160.3 on 1 and 23 DF,  p-value: 7.546e-12

equation of fitted line: hat(y) = 0.13088 + 0.24115(x)

##################QUESTION 2############################

The assumptions we need to make for linear regression are:

The relationship between response y and the regressors is linear, at least approximately
The random error term ε has zero mean and constant variance σ2
The errors are uncorrelated
The errors are normally distributed

##################QUESTION 3############################

Constructing scatter plots for response variable vs. the regressors, we clearly see that they are slightly curved and not linear. This means they do not support the assumption on linearity.

pairs(Dat1)

Plotting the residual errors against the fitted values we that we do not have a zero mean. When dividing the plot into thin vertical strips, we can clearly see that we do not have average value of zero for almost any thin vertical strip. We also do not have Constant variance as the spread of the residuals are not the same in any thin vertical strip. We also see that there is a clear pattern or rather curved line which is not ideal.

plot(slr$fitted.values, slr$residuals, 
     main = "Check for 0 mean and constant var \n  Residual vs. fitted value")
abline(h=0)

Plotting the residual errors against the row numbers sorted by regressor variables helps us check for the assumption of independence of error. Here we clearly see that the residuals show a pattern in the form of a concave down line. This violates the assumption of independence of error.

row_num <- c(1:nrow(Dat1))
sort_x1 <- sort(Dat1$x, index.return=TRUE)
plot(row_num, slr$residuals[sort_x1$ix],
main = "Check for independence \n Residuals sorted by x")
abline(h=0)

The normal probability plot shows us that we nearly have normality of random error but not quite. We can also see that towards the beginning and end of the plot, the points do not exaclty follow the line. The shape of the graph indicates that we have a heavily tailed distribution.

qqnorm(slr$residuals)
qqline(slr$residuals)

shapiro.test(slr$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  slr$residuals
## W = 0.93587, p-value = 0.1188

Using the Shapiro-Wilk normality test, p-value = 0.1188 We fail to reject H0 and conclude that data shows evidence that the errors do follow normal distribution.

##################QUESTION 4############################

Our transformation will be squaring the y variable

slr1 <- lm((y^2) ~ x, data = Dat1)
summary(slr1)

## 
## Call:
## lm(formula = (y^2) ~ x, data = Dat1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.74840 -0.31027  0.05951  0.30793  0.57072 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.35851    0.21239  -6.396 1.58e-06 ***
## x            0.71066    0.03211  22.130  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3979 on 23 degrees of freedom
## Multiple R-squared:  0.9551, Adjusted R-squared:  0.9532 
## F-statistic: 489.7 on 1 and 23 DF,  p-value: < 2.2e-16

We will now verify the assumptions

pairs((y^2) ~ x, data=Dat1)

plot(Dat1$x, (Dat1$y)^2, 
     main = "Check for linearity \n y^2 vs. x")
abline(h=0)

plot(slr1$fitted.values, slr1$residuals, 
     main = "Check for 0 mean and constant var \n  Residual vs. fitted value")
abline(h=0)

row_num <- c(1:nrow(Dat1))
sort_x1 <- sort(Dat1$x, index.return=TRUE)
plot(row_num, slr1$residuals[sort_x1$ix],
main = "Check for independence \n Residuals sorted by x")
abline(h=0)

qqnorm(slr1$residuals)
qqline(slr1$residuals)

shapiro.test(slr1$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  slr1$residuals
## W = 0.94769, p-value = 0.2223

Constructing scatter plots for our new model, we clearly see that they linear. This means we can verify the assumption of linearity. Plotting the residual errors against the fitted values we can verify the assumption of zero mean and constant variance. When dividing the plot into thin vertical strips, we can clearly see that we have an average value of zero for almost any thin vertical strip. We also have Constant variance as the spread of the residuals are not the same in any thin vertical strip. We also see that there is no clear pattern which is ideal. Plotting the residual errors against the row numbers sorted by regressor variables helps us check for the assumption of independence of error. Here we clearly see that the residuals do not show any particular pattern. This verifies the assumption of independence of error. The normal probability plot shows us that we can verify the assumption of normality of random error since the points follow the line linearly. Lastly, Using the Shapiro-Wilk normality test, p-value = 0.2223 We fail to reject H0 and conclude that data shows evidence that the errors follow normal distribution.

##################QUESTION 5############################

x=5, estimate y

x0 <- data.frame(x=5)
predict(slr1, x0, interval = "prediction", level = 0.95)

##        fit      lwr      upr
## 1 2.194792 1.351937 3.037647

sqrt(1.351937)

## [1] 1.162728

sqrt(3.037647)

## [1] 1.742885

The mean DC_output for where wind_velocity = 5 mph can be predicted using the 95% interval of (sqrt(1.351937),sqrt(3.037647)) which is (1.162728,1.742885)

#######ALL OF THE R CODE USED IN ORDER##################

Q1.

Dat <- chemical mlr <- lm(y ~ x6 + x7, data = Dat) summary(mlr)

dat <- subset(Dat, select = c(1,7,8)) pairs(dat)

plot(mlr$fitted.values, mlr$residuals, main = “Check for 0 mean and constant var Residual vs. fitted value”) abline(h=0)

row_num <- c(1:nrow(dat)) sort_x1 <- sort(dat$x6, index.return=TRUE) plot(row_num, mlr$residuals[sort_x1$ix], main = “Check for independence Residuals sorted by x6”) abline(h=0)

row_num <- c(1:nrow(dat)) sort_x1 <- sort(dat$x7, index.return=TRUE) plot(row_num, mlr$residuals[sort_x1$ix], main = “Check for independence Residuals sorted by x7”) abline(h=0)

qqnorm(mlr$residuals) qqline(mlr$residuals) shapiro.test(mlr$residuals)

Q2.

Dat1 <- windmill slr <- lm(y ~ x, data = Dat1) summary(slr)

pairs(Dat1)

plot(slr$fitted.values, slr$residuals, main = “Check for 0 mean and constant var Residual vs. fitted value”) abline(h=0)

row_num <- c(1:nrow(Dat1)) sort_x1 <- sort(Dat1$x, index.return=TRUE) plot(row_num, slr$residuals[sort_x1$ix], main = “Check for independence Residuals sorted by x”) abline(h=0)

qqnorm(slr$residuals) qqline(slr$residuals)

shapiro.test(slr$residuals)

slr1 <- lm((y^2) ~ x, data = Dat1) summary(slr1)

pairs((y^2) ~ x, data=Dat1)

plot(Dat1$x, (Dat1$y)^2, main = “Check for linearity y^2 vs. x”) abline(h=0)

plot(slr1$fitted.values, slr1$residuals, main = “Check for 0 mean and constant var Residual vs. fitted value”) abline(h=0)

row_num <- c(1:nrow(Dat1)) sort_x1 <- sort(Dat1$x, index.return=TRUE) plot(row_num, slr1$residuals[sort_x1$ix], main = “Check for independence Residuals sorted by x”) abline(h=0)

qqnorm(slr1$residuals) qqline(slr1$residuals)

shapiro.test(slr1$residuals)

Homework 4

Emily Weiland

2023-10-28