Learning Log Day 6

Before we can do any hypothesis testing, we need to create a model. For my model, I will be using the iris dataset. I will determine the effect that species and sepal width have on sepal length.

mod <- lm(Sepal.Length ~ Sepal.Width + Species, data = iris)
summary(mod)

## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.30711 -0.25713 -0.05325  0.19542  1.41253 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.2514     0.3698   6.089 9.57e-09 ***
## Sepal.Width         0.8036     0.1063   7.557 4.19e-12 ***
## Speciesversicolor   1.4587     0.1121  13.012  < 2e-16 ***
## Speciesvirginica    1.9468     0.1000  19.465  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.438 on 146 degrees of freedom
## Multiple R-squared:  0.7259, Adjusted R-squared:  0.7203 
## F-statistic: 128.9 on 3 and 146 DF,  p-value: < 2.2e-16

Our summary gives us point estimates for our coefficients. We can interpret each of these. As an example we’ll interpret the coefficient for sepal width: For every increase by one unit in sepal width, the sepal length will increase by .8 units on average, with species held constant.

From our summary, we can see that each of the predictor variables we used are significant in predicting the sepal length of an iris. However, we can also run a hypothesis test ourselves to determine the significance of any one coefficient. For this example, we’ll use a hypothesis test to determine the p-value for the sepal width coefficient.

Hypothesis Test for Coefficients

For our hypothesis test, our null hypothesis is that the sepal width coefficient in our model is equal to 0 and our alternative hypothesis is that the sepal width coefficient in our model is not equal to 0.

Next, we’ll need to determine the test statistic that we’ll use.

tstat <- coef(summary(mod))[2,1]/coef(summary(mod))[2,2]
tstat

## [1] 7.556598

We get a test statistic of 7.557, using on our equation. Knowing this, we can obtain the p-value for our coefficient to determine whether or not we should reject our null hypothesis. To do that, we’ll need the degrees of freedom which we can also see in the summary we generated, or calculate by total observations - number of predictors - 1. Our degrees of freedom is 146.

2*pt(tstat, 146, lower.tail=FALSE)

## [1] 4.18734e-12

We get a p-value of 4.18734 x 10^-12, which agrees with the output from the summary of our model. From this p-value, we can reject the null hypothesis and say that the coefficient for sepal width, in our model, is not 0.

Confidence Interval for Coefficients

Now, we can create and interpret a confidence interval for sepal width. We start by creating the confidence interval. We can create the confidence interval by hand, but, luckily, there’s a function in R that can do that for us so we don’t have to deal with the equation.

confint(mod)

##                       2.5 %   97.5 %
## (Intercept)       1.5206309 2.982156
## Sepal.Width       0.5933983 1.013723
## Speciesversicolor 1.2371791 1.680307
## Speciesvirginica  1.7491525 2.144481

We see from our output that the confidence interval for sepal width is [.5934, 1.014]. We can then interpret this value: Given that the iris is from the same species, we are 95% confident that the mean effect of the sepal width will be an increase of between .5934 and 1.014 units in sepal length for every one unit increase in sepal width.

Confidence Interval for Data

Finally, we want to create a confidence interval and a prediction interval for a setosa iris with a sepal width of 3 units. We’ll start out by creating and interpreting the confidence interval.

newdat <- data.frame(Sepal.Width=3, Species="setosa")
conf <- predict(mod, newdat, interval = "confidence")
conf

##        fit      lwr      upr
## 1 4.662076 4.510173 4.813979

Our output gives us a point estimate for the mean sepal length of a setosa iris with a sepal width of 3 units along with its confidence interval. Our point estimate is 4.662 and our confidence interval is [4.51, 4.814]. So, we are 95% confident that the mean sepal length of a setosa iris with a sepal width of 3 units is between 4.51 and 4.814 units.

Prediction Interval for Data

Then, we’ll create and interpret the prediction interval for an individual setosa iris’ sepal length, given that its sepal width is 3 units.

pred <- predict(mod, newdat, interval = "predict")
pred

##        fit      lwr      upr
## 1 4.662076 3.783294 5.540858

Here, we get the same point estimate for the individual iris, which is to be expected, but we see a wider interval since there is more variability for individuals than for a group. We obtain an interval of [3.783, 5.541]. We can interpret this as such: We are 95% confident that the sepal length of an individual setosa iris with a sepal width of 3 units is between 3.783 and 5.541.

Learning Log Day 6

Sydney Benson

2/22/2018

Hypothesis Test for Coefficients

Confidence Interval for Coefficients

Confidence Interval for Data

Prediction Interval for Data