For this leaning log I will be looking at the iris data in order to determine the effect that species and sepal width have on sepal length.
mod <- lm(Sepal.Length ~ Sepal.Width + Species, data = iris)
summary(mod)
##
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.30711 -0.25713 -0.05325 0.19542 1.41253
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2514 0.3698 6.089 9.57e-09 ***
## Sepal.Width 0.8036 0.1063 7.557 4.19e-12 ***
## Speciesversicolor 1.4587 0.1121 13.012 < 2e-16 ***
## Speciesvirginica 1.9468 0.1000 19.465 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.438 on 146 degrees of freedom
## Multiple R-squared: 0.7259, Adjusted R-squared: 0.7203
## F-statistic: 128.9 on 3 and 146 DF, p-value: < 2.2e-16
Hypothesis test: H0 = 0 H1 ??? 0 In context, we are looking at the sepal width coefficient in this hypothesis test. In this case we need to look at assumptions in regards to normality, independence and random sample. I would also like to set alpha to ,05 because this is a common selection that is conversative without being overly conservative. Now we are going to calculate the test stat and p value to know if we should accept or reject the null.
tstat <- coef(summary(mod))[2,1]/coef(summary(mod))[2,2]
tstat
## [1] 7.556598
2*pt(tstat, 146, lower.tail=FALSE)
## [1] 4.18734e-12
From this p-value of 4.18734e-12, we can reject the null hypothesis and say that we found significant evidence that the sepal width is different from zero.
Now its on to confidence intervals!
confint(mod)
## 2.5 % 97.5 %
## (Intercept) 1.5206309 2.982156
## Sepal.Width 0.5933983 1.013723
## Speciesversicolor 1.2371791 1.680307
## Speciesvirginica 1.7491525 2.144481
From this we can say that we are 95% confident that the true population of sepal width is between 0.5933983 and 1.013723. This is important because the interval does not contain zero, so we are 95% confident that the true population of sepal width does not include 0.
Next we move on to predicting and confidence intervals
newdata <- data.frame(Sepal.Width=2, Species="virginica")
confint <- predict(mod, newdata, interval = "confidence")
confint
## fit lwr upr
## 1 5.805332 5.566826 6.043838
We are 95% confident that the mean sepal length of a virginica iris with a sepal width of 2 units is between 5.566826 and 6.043838 units.
newdata <- data.frame(Sepal.Width=2, Species="virginica")
predint <- predict(mod, newdata, interval = "predict")
predint
## fit lwr upr
## 1 5.805332 4.907519 6.703145
We are 95% confident that the mean sepal length of a virginica iris with a sepal width of 2 units is between 4.907519 and 6.703145 units. Given the formula for prediction interval relative to the confidence interval it makes sense that this would be wider. It is also important to notice that the point estimate is the same for both.
confint[1] == predint[1]
## [1] TRUE