For this learning log, I chose the “women” data set. This data set gives the average heights and weights for American women aged 30-39.

data(women)
attach(women)

Scatterplots

I am interested in the relationship between weight and {height}. {height} is recorded in inches, while weight is recorded in pounds.

I will use {height} as the explanatory variable and weight as the response, due to the fact that it’s generally easier to change weights than change height.

plot(height,weight,ylab="Weight (lbs)",xlab="Height (in)")

It can be seen from the scatterplot that there is a positive association between the two variables. The points seem to follow the trend pretty closely, and there seems to be an obvious linear relationship.

Creating the Linear Regression Model

mymod<-lm(weight~height)
mymod

## 
## Call:
## lm(formula = weight ~ height)
## 
## Coefficients:
## (Intercept)       height  
##      -87.52         3.45

The regression line is \[\hat{y_i}= \hat{\beta_0}+\hat{\beta_1} x_i\] Plugging in our numbers, \[\hat{y_i}= -87.52 + 3.45 x_i\]

Interpreting the Linear Regression

For every inch increase in height, the average American woman’s weight is increased on average by 3.45 pounds.

We cannot interpret the intercept because we cannot reasonably set the explanatory variable to zero. This makes sense, as there’s no possible way that a woman’s height can be zero.

Plotting the Regression Line

plot(height,weight,ylab="Weight (lbs)",xlab="Height (in)")
abline(-87.52,3.45)

Prediction

Let’s predict the first record.

firstheight<-women[1,1]
firstactual<-women[1,2]
firstpredicted<-coef(mymod)%*%c(1,firstheight)
firstpredicted

##          [,1]
## [1,] 112.5833

The prediction for the weight of an American woman with height 58 inches is 112.5833 pounds.

Residual Calculation

firstresid<-firstactual-firstpredicted
firstresid

##          [,1]
## [1,] 2.416667

So our prediction was 2.4167 pounds too low.

Checking Model Assumptions

myresids<-mymod$residuals
hist(myresids)

This histogram doesn’t seem to be normally distributed.

qqnorm(myresids)
qqline(myresids)

If the residuals were normally distributed, they’d follow a straight line. The points near the end of the line seem to be pretty far off. I’d say our assumption of normal distribution isn’t fulfilled.

Let’s look at the original plot again to check for homoscedasticity.

plot(height,weight,ylab="Weight (lbs)",xlab="Height (in)")

Although it seems to follow a straight line, we don’t have much data around each height.

Let’s look at the residuals plotted against the height.

plot(mymod$residuals ~ height)
abline(0,0)

It seems to me that there is an obvious curve, which points to heterscedasticity instead of homoscedasity, so the variance doesn’t seem to be constant.

Mean Square Error

summary(mymod)

## 
## Call:
## lm(formula = weight ~ height)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7333 -1.1333 -0.3833  0.7417  3.1167 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
## height        3.45000    0.09114   37.85 1.09e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared:  0.991,  Adjusted R-squared:  0.9903 
## F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14

So the MSE is 1.525, with 13 degrees of freedom because there are 15 observations, minus 2 for the two terms with Beta.

Learning Log 3

Madison Bemis

February 6, 2018