In class on Tuesday, we reviewed linear regression and learned how to create regression models in R.
In this learning log, I will use the “women” dataset and demonstrate my knowledge of the R commands regarding linear regression.
For this document, I assume that I have two quantitative variables. I will plot the data and interpret the scatterplot, calculate correlation, fit the regression line, interpret it, use it for prediction, and calculate residuals.
I am attaching the data set so that I don’t have to refer to it by name every time.
attach(women)
I am interested in the relationship between two variables: Height and Weight. Height is the explanatory variable and Weight is the response.
The following code will plot height against weight and create a scatterplot with appropriate labels.
plot(height, weight, xlab = "Women Height (in)", ylab = "Women Weight (lbs)", main = ("Women Height and Weight Scatterplot"))
There appears to be a strong positive linear relationship between women height and weight, so performing a linear regression and obtaining a line of best fit is a good idea.
I will now create the linear model and call it.
LearningLog3 <- lm(weight ~ height)
LearningLog3
##
## Call:
## lm(formula = weight ~ height)
##
## Coefficients:
## (Intercept) height
## -87.52 3.45
The output tells us the equation of the line of best fit (y = 3.45x - 87.52). This means that for every inch added to a woman’s height, we can expect her to weigh 3.45 pounds more. However, this is one of those cases where you cannot logically interpret the y-intercept. It is impossible for someone to have negative weight. Prediction should only be allowed to happen within the range of the data. In this case, only between the heights of 58 and 72 inches.
To place the line of best fit, I simply use the command abline() and input my two outputs from the call.
plot(height, weight, xlab = "Women Height (in)", ylab = "Women Weight (lbs)", main = ("Women Height and Weight Scatterplot"))
abline(-87.52,3.45)
I want to predict the weight of a 63 inch tall woman. I need to know what row contains height 63.
which(height==63)
## [1] 6
RealW = women[6,"weight"]
PredW = coef(LearningLog3)%*%c(1, 63)
RealW
## [1] 129
PredW
## [,1]
## [1,] 129.8333
The residual is the difference between the actual and predicted values.
ResidW = RealW - PredW
ResidW
## [,1]
## [1,] -0.8333333
The residual for height 63 is -0.83. To see all of my residuals, I create a qq plot.
LL3r <- LearningLog3$residuals
qqnorm(LL3r)
qqline(LL3r)
plot(LL3r ~ height)
abline(0,0)
I don’t see evidence of homoscedasticity.
I expect to see a large Mean Square Error. I can check this value by viewing a summary of my linear regression model.
summary(LearningLog3)
##
## Call:
## lm(formula = weight ~ height)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7333 -1.1333 -0.3833 0.7417 3.1167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -87.51667 5.93694 -14.74 1.71e-09 ***
## height 3.45000 0.09114 37.85 1.09e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
## F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14
The MSE is labeled the Residual Standard Error and it has the value of 1.525 on 13 degrees of freedom (because 15 heights and weights were given).