— title: “c07w03q02” author: “Heiko Lange” date: “7 10 2017” output: html_document header-includes: - —

{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE)

Investigate the relationship between x1 and y It has been widely noted that

there is a relationship between x1 and y. We are interested in studying the magnitude of this relationship and reporting results to the president of the company. Use a linear regression model to investigate this relationship.

Question 1 Would you be confident telling the company president that there is

a meaningful relationship between x1 and y?

cor(df)

##              y         x1         x2          x3
## y  1.000000000 0.88924706 0.40483873 0.006692403
## x1 0.889247057 1.00000000 0.05990831 0.017192919
## x2 0.404838730 0.05990831 1.00000000 0.041073775
## x3 0.006692403 0.01719292 0.04107377 1.000000000

There seems to be a high correlation between y and x1. A single linear regression line can visually show this:

fit <- lm(y ~ x1, data = df)
plot(x = df$x1, y = df$y)
abline(fit, col = "red", lwd = "2")

Looking at the residuals, there doesn’t seem a seem to be a specific pattern:

plot(resid(fit))

As x2 had a correlation with y aswell, we will fit a modell y ~ x1 + x2 and show that this model is superior to the first, while adding x3 to the modell does not lead to statistical significant improvement of the modell (the sum of squared errors is only slightly decreased). Adding x3 wouldn’t increase the variance by alot, as the variance inflation factor for x3 is very limited. However, to explain most of our observations x1 and x2 do well and I choose model 2 for parsimony for answering further questions.

library(car)

## Warning: package 'car' was built under R version 3.4.1

fit2 <- lm(y ~ x1 + x2, data = df)
fit3 <- lm(y ~ x1 + x2 + x3, data = df)
anova (fit, fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: y ~ x1
## Model 2: y ~ x1 + x2
## Model 3: y ~ x1 + x2 + x3
##   Res.Df     RSS Df Sum of Sq        F  Pr(>F)    
## 1    498 2077455                                  
## 2    497  845879  1   1231577 726.5756 < 2e-16 ***
## 3    496  840741  1      5137   3.0307 0.08232 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

vif(fit3)

##       x1       x2       x3 
## 1.003821 1.005220 1.001909

Comparing residuals from modell 1 and modell 2 both don’t show any patterns:

par(mfcol = c(1, 2))
plot(resid(fit))
plot(resid(fit2))

Would you be confident telling the company president that there is a meaningful relationship between x1 and y?

Yes, there is a meaninful relationship between x1 and y.

Question 2

Report the estimated coefficient for x1 from your model to 6 significant digits.

summary(fit2)$coef[2,1]

## [1] 4.03364

Question 3

Report the 95% confidence interval for the coefficient for x1 from your model. Use 6 significant digits for both the lower and upper bounds. Report it in the format:

(Lower Bound, Upper Bound)

cat ("By hand")

## By hand

summary(fit2)$coef[2,1] + c(-1, 1) * qt(1 - 0.05/2, df = df.residual(fit2)) * summary(fit2)$coef[2,2]

## [1] 3.913899 4.153380

cat ("via confint(...)")

## via confint(...)

confint(fit2, level = 0.95)

##                  2.5 %    97.5 %
## (Intercept) -39.262399 11.604294
## x1            3.913899  4.153380
## x2            1.972051  2.282821

Question 4

Report the p-value associated with the coefficient for x1 from your model to 6 significant digits. Use scientific notation.

format(summary(fit2)$coef[2,4], scientific = TRUE)

## [1] "1.271573e-248"