Learning log Day 4

Hypothesis Testing and Linear Regression

We already know that R is able to easily create linear regression models from a given set of data, and provide a detailed summary, as shown below.

data(women)
attach(women)
MDL <- lm(weight ~ height)
summary(MDL)

## 
## Call:
## lm(formula = weight ~ height)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7333 -1.1333 -0.3833  0.7417  3.1167 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
## height        3.45000    0.09114   37.85 1.09e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared:  0.991,  Adjusted R-squared:  0.9903 
## F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14

The data is for the heights and weights of women. From the output, we get an intercept of -87.51667 and a slope of 3.45. Our regression equation is \[\hat{weight}= 3.45*\hat{height} - 87.51667\]

We can perform a two-sided hypothesis test on the slope of this model. The null hypothesis is the slope is equal to zero, and the alternative hypothesis is the slope is not equal to zero. The test statistics are found under “t value” in the model summary, which is 37.85 for the slope.

The next column lists p-values for two sided tests. In this case, the p-value is much smaller than any reasonable level of confidence, and thus we would reject the null hypothesis. There is strong evidence to suggest that the slope of the regression line is not zero.

Now let’s think about why this is helpful. The slope of a regression line tells us the expected change in the response variable for a change of 1 in the explanatory variable. In this case, the slope of 3.45 means for an increase of height by 1 inch, we expect weight to increase by 3.45 pounds. If the slope was zero, then there would be no relationship between height and weight. The hypothesis test allows us to determine if the slope is significantly different from zero. If the slope is not significantly different from 0, then there is not enough evidence to suggest a significant linear relationship between the explanatory and response variables.

In our test, since we rejected the null hypothesis, there is strong evidence of a linear relationship between the height and weight of women.

Confidence Intervals for Slope and Intercept

We can also use R to create confidence intervals for the slope and intercept of our model, as shown below.

int95 <- confint(MDL, level = 0.95)
int95

##                   2.5 %     97.5 %
## (Intercept) -100.342655 -74.690679
## height         3.253112   3.646888

The output gives the upper and lower bounds for both the intercept and the slope. To create a confidence interval at a different level of confidence, simply change the level argument. Below is a 99% confidence interval.

int99 <- confint(MDL, level = 0.99)
int99

##                   0.5 %     99.5 %
## (Intercept) -105.400380 -69.632954
## height         3.175472   3.724528

Prediction Intervals

It’s important to note the distinction between confidence intervals and prediction intervals. Confidence intervals are for a parameter, whereas prediction intervals are for a random variable. In the case of regression, the prediction interval is for one value of the response variable given a particular value of the explanatory.

The process of calculating prediction intervals is fairly straightforward.

predict (MDL, women, interval = "predict")

##         fit      lwr      upr
## 1  112.5833 108.9122 116.2545
## 2  116.0333 112.4315 119.6352
## 3  119.4833 115.9412 123.0255
## 4  122.9333 119.4408 126.4259
## 5  126.3833 122.9298 129.8368
## 6  129.8333 126.4080 133.2587
## 7  133.2833 129.8750 136.6916
## 8  136.7333 133.3307 140.1360
## 9  140.1833 136.7750 143.5916
## 10 143.6333 140.2080 147.0587
## 11 147.0833 143.6298 150.5368
## 12 150.5333 147.0408 154.0259
## 13 153.9833 150.4412 157.5255
## 14 157.4333 153.8315 161.0352
## 15 160.8833 157.2122 164.5545

Now let’s look at the output. The first column indicates which observation from the original dataset was used for that specific interval. The women dataset contained height and weight data for 15 observations, so there are 15 total intervals here. Fit refers to predicted weight from our linear regression model. In general, prediction intervals are centered at \[\hat{y}=\hat{\beta_0} + \hat{\beta_1}x_0\] where \[x_0\] represents the value of the predictor from observed data. The lower and upper bounds are in the third and fourth columns, respectively. Looking at the first observation, the prediction interval for the weight, in pounds, of a woman 58 inches tall is (108.9122, 116.2545).

Variability of the Response Variable

Total variation is broken down into explained and unexplained variation. Explained variation is explained by the model we created. Our model summary will help us analyze the variability in the response variable.

summary(MDL)

## 
## Call:
## lm(formula = weight ~ height)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7333 -1.1333 -0.3833  0.7417  3.1167 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
## height        3.45000    0.09114   37.85 1.09e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared:  0.991,  Adjusted R-squared:  0.9903 
## F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14

The r-squared value, given as “Multiple R-Squared” in the model summary, tells us the percent of the variability in the response variable that is explained by its linear relationship with the predictor. In this example, we can see r-squared is 0.991. 99.1% of the variability in the weight of women is explained by the linear relationship with height. r by itself is called the correlation, and can be positive or negative depending on whether the slope is positive or negative. r is negative if an increase in the explanatory is associated with an decrease in the response (in other words, if the slope is negative). r is positive if an increase in the explanatory is associated with an increase in the response (slope is positive). Calculating the correlation can be done by taking the square root of r-squared.

sqrt(0.991)

## [1] 0.9954898

Another way to calculate correlation is to use cor.

cor(height, weight)

## [1] 0.9954948

When using cor, the order of the two variables does not matter. If we switch the order of the variables, we should end up with the same answer.

cor(weight, height)

## [1] 0.9954948

Both answers are identical.

F-tests

The F-test is a test for a linear relationship between the predictor and response. The null hypothesis states there is no linear relationship, while the alternative hypothesis states there is a linear relationship. The F-test duplicates the t-test for the slope of the regression line. Recall that the null hypothesis for the t-test is that the slope is equal to zero, which implies a lack of a linear relationship between the explanatory and response variables. The alternative hypothesis for the t-test states that the slope is not equal to zero, which implies there is some linear relationship between the two variables.

The information needed for the F-test can be found under the model summary.

summary(MDL)

## 
## Call:
## lm(formula = weight ~ height)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7333 -1.1333 -0.3833  0.7417  3.1167 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
## height        3.45000    0.09114   37.85 1.09e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared:  0.991,  Adjusted R-squared:  0.9903 
## F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14

The last line, under F-statistic, gives all the information needed. 1433 is the value of the F-test statistic. 1 and 13 are the first and second degrees of freedom. Finally, the p-value for the F-test is given. Note that this p-value is identical to the p-value of the t-test for slope.