For today’s learning log, I will begin by discussing confidence and prediction intervals for regression coefficients.
Recall the equation for simple linear regression: \[ Y_i= {\beta_0} + {\beta_0} x_i\]
For our linear model, we’ll use the following equation as a proxy for our specific sample data.: \[\hat{y_i}= \hat{\beta_0}+\hat{\beta_0} x_i\] Where we say: \[\hat{Y_i}\] is our point estimation for our response variable \[\hat{\beta_0}\] is our point estimate for the Y-intercept of the model \[\hat{\beta_1}\] is our point estimate for the slope of the linear model and \[ x_i \] is a given value of X
Since these model parameters are indeed point estimates, we are interested in knowing the relevant range that these parameters could fall into given a different sample. We will use the “Women” dataset illustration the testing of these parameters
data(women)
attach(women)
summary(women)
## height weight
## Min. :58.0 Min. :115.0
## 1st Qu.:61.5 1st Qu.:124.5
## Median :65.0 Median :135.0
## Mean :65.0 Mean :136.7
## 3rd Qu.:68.5 3rd Qu.:148.0
## Max. :72.0 Max. :164.0
str(women)
## 'data.frame': 15 obs. of 2 variables:
## $ height: num 58 59 60 61 62 63 64 65 66 67 ...
## $ weight: num 115 117 120 123 126 129 132 135 139 142 ...
We see that the dataset includes heights and weights of 150 women. We’ll use this data to see if we can create confidence and prediction intervals for these variables.
WomenMod <- lm(weight ~ height)
WomenMod
##
## Call:
## lm(formula = weight ~ height)
##
## Coefficients:
## (Intercept) height
## -87.52 3.45
So our linear model now looks like this: \[\hat{weight}= {-87.52} + {3.45} height \]
plot(height, weight)
abline(-87.52,3.45)
We know that: 1. Our slope B1(hat) follows a normal distribution with mean = B1 and variance = (s/sqrt(SSxx)). 2. Our interecpt B0(hat) follows a normal distribution with mean = B0 and variance = s(sqrt(1/n + xbar^2/SSxx))
We can use the confint() function to produce a confidence interval with a given confidence %. For this example, we’ll use the standard 95%.
confint(WomenMod, level=.95)
## 2.5 % 97.5 %
## (Intercept) -100.342655 -74.690679
## height 3.253112 3.646888
Our code output tells us: For 1 inch increase in height, this model predicts that 95% of samples will exhibit a slope of between 3.253 and 3.647 And that our 95% of samples will give us an intercept between -100.3 and -74.7. This of course is rather useless to us.
The blue lines show the intervals for the slope extremes while the red lines show the intervals for the intercept extremes
plot(height, weight)
abline(-87.52, 3.253112, col = 'blue')
abline(-87.52,3.45)
abline(-87.52, 3.64688, col = 'blue')
abline(-100.3,3.45, col = 'red')
abline(-74.7,3.45, col = 'red')
\[\hat{y_i}= \hat{\beta_0}+\hat{\beta_0} x_i\]
Now, we interested in examining our margin of error for the mean response variable \[\hat{y_i}\]
which follows \[N(u_(y_/x_), s(\sqrt{distance~ value}))\] where \[Distance~Value = 1/n + ((x_0 - \bar{x})^2)/SSxx \]
For a given xi, we are interested in knowing what the mean yi will be based on our linear model. In this example, we’ll find the weight confidence interval for a woman 72 inches height
newdata <- data.frame(height=72)
confidentint <- predict(WomenMod, newdata, interval="confidence", level = .95)
confidentint
## fit lwr upr
## 1 160.8833 159.2637 162.5029
From our R output, we learn that 95% of samples will produce a mean weight for a women 72 inches tall between 159.26 and 162.50.
If we are instead interested in predicting a single value response value \[\hat{y_i}\] We can look at the residual error \[y-\hat{y_i}\] wich follows which follows \[N(0, s(\sqrt{1+distance value}))\] where \[Distance~Value = 1/n + (x_0 - \bar{x})^2/SSxx \]
This distribution includes slightly more variation, since we are predicting a single yi from a single xi, instead of a mean yi. We will again choose xi = 72 inches
WomenMod <- lm(weight ~ height)
newdata <- data.frame(height=72)
predictint <- predict(WomenMod, newdata, interval="predict", level = .95)
predictint
## fit lwr upr
## 1 160.8833 157.2122 164.5545
From our R output, we learn that 95% of samples will produce a mean weight for a women 72 inches tall between 157.212 and 164.555.
The prediction interval is roughly two times larger than the confidence for the mean, which matches our intuition.
Both the prediction and confidence interval should have the same center point. Let’s verify.
predictint[1] == confidentint[1]
## [1] TRUE
Jolly good
The end of chapter 3 discusses correlation in terms of R-squared and r.
R-Squared is know as the simple coefficient of determination. When comparing two variables, R-squared represents the proportion of total variation in the response variable that is explained by the simple linear regression model. A higher R-squared shows that the predictor variable predicts the responses variable well.
\[ r^2 = (Explained~ Variation/Total~ Variation) \] or
\[ r^2 = (\sum(\hat{y_i} - \bar{y})^2)/(\sum(y_i - \bar{y})^2)\]
summary(WomenMod)
##
## Call:
## lm(formula = weight ~ height)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7333 -1.1333 -0.3833 0.7417 3.1167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -87.51667 5.93694 -14.74 1.71e-09 ***
## height 3.45000 0.09114 37.85 1.09e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
## F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14
At the bottom of the summary output, we see the multiple R-Squared as 0.991. This means that 99.1 % of the variation in weights can be explained by our linear model using heights. R-Squared can range from 0 to 1.
Further, we can measure the correlation coefficient, r, as well. r measures the strength of the linear relationship between the response and predictor variables.
As logic dictates \[ r = +_-\sqrt{r^2} \]
r can range from -1 to 1 with -1 being a very strong negative relationship while 1 being a strong positive relationship.
cor(height, weight)
## [1] 0.9954948
cor(weight, height)
## [1] 0.9954948
As far as R code goes, either the x or y variable can be typed first. With a correlation coefficient of 0.9955, we can say that there is a strong positive linear relationship between women’s height and weight. i.e. as weight increases, height tends to increase also.
No causation though, ever