Learning Log_2/6/2018

We will look at the relationship between women’s height and weight, specifically whether height explains weight, found in the data set “women”

attach(women)

First, we plot the height and the weight on a scatterplot to get a general idea of the data.

plot(height, weight, ylab = "Weight in Pounds", xlab = "Height in Inches")

The data appears to be fairly linear, with a positive association. The points seem to follow the trend closely. Since it seems likely that the data is linear, we create a linear regression model:

mymod <- lm(weight ~ height)
mymod

## 
## Call:
## lm(formula = weight ~ height)
## 
## Coefficients:
## (Intercept)       height  
##      -87.52         3.45

So, our regression equation is weight= -87.52 + 3.54*height + e.

Since we do not have data near 0, the intercept is meaningless. However, we can interpret the slope: For ever increase in height by one inch, on average, weight increases by 3.45 inches.

Next, we can plot the regression line with the data to confirm that it looks reasonable.

plot(height, weight ,ylab="Weight in Pounds",xlab="Height in Inches")
abline(-87.52, 3.45)

Using the regression equation we found, we can predict a weight given a height. Lets look at the first data point in the data set, with height:

women[1,1]

## [1] 58

The actual weight for this height is:

women[1,2]

## [1] 115

The weight predicted using the height and the regression equation we found is:

coef(mymod)%*%c(1, women[1,1])

##          [,1]
## [1,] 112.5833

Now that we have the actual height and the predicted height, we can find the residual.

weight_actual <- women[1, 2]
height_actual <- women[1, 1]
weight_predicted <- coef(mymod)%*%c(1, height_actual)
resid <- weight_actual - weight_predicted
resid

##          [,1]
## [1,] 2.416667

We can find the residual for each point. We make a histogram of the residuals to see if they are approximately normal. In the histogram, it looks like the points are not normal.

myresids <- mymod$residuals
hist(myresids)

We can also plot the quantiles of the resuduals against the quantiles of a normal distribution to see if they line up.

qqnorm(myresids)
qqline(myresids)

This plot also seems to indicate that the residuals are not approximately normal.

We can also check the assumption of homoscedacticity. We plot the residuals against height to see if the residuals are evenly spread out.

plot(mymod$residuals ~ women$height)
abline(0,0)

The pattern of the points shows that the variance is not equal throughout the data.

Lastly we can find the mean square error by looking at the summary of the model.

summary(mymod)

## 
## Call:
## lm(formula = weight ~ height)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7333 -1.1333 -0.3833  0.7417  3.1167 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
## height        3.45000    0.09114   37.85 1.09e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared:  0.991,  Adjusted R-squared:  0.9903 
## F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14

This summary shows that the mean square error is 1.525 with 13 degrees of freedom, since we have 15 data points and 2 variables.

Learning Log_2/6/2018

Davita Blyakher

February 6, 2018