This analysis looks at the relationship between Height
and Weight
. The independent variable is Height
and dependent variable is Weight
.
data <- read.csv(file="C:/Users/stina/Desktop/Fall 2018/Data 605/hw11/weight-height.csv", header=TRUE, sep=",")
head(data, 20)
## Gender Height Weight
## 1 Male 73.84702 241.8936
## 2 Male 68.78190 162.3105
## 3 Male 74.11011 212.7409
## 4 Male 71.73098 220.0425
## 5 Male 69.88180 206.3498
## 6 Male 67.25302 152.2122
## 7 Male 68.78508 183.9279
## 8 Male 68.34852 167.9711
## 9 Male 67.01895 175.9294
## 10 Male 63.45649 156.3997
## 11 Male 71.19538 186.6049
## 12 Male 71.64081 213.7412
## 13 Male 64.76633 167.1275
## 14 Male 69.28307 189.4462
## 15 Male 69.24373 186.4342
## 16 Male 67.64562 172.1869
## 17 Male 72.41832 196.0285
## 18 Male 63.97433 172.8835
## 19 Male 69.64006 185.9840
## 20 Male 67.93600 182.4266
Weight
and Height
Weight increases as height increases.
plot(data$Height, data$Weight, xlab = "Height", ylab = "Weight")
Create one factor linear model.
Intercept
is -350.73719. Slope
is 7.71729.
height_weight.lm <- lm(Weight ~ Height, data = data)
summary(height_weight.lm)
##
## Call:
## lm(formula = Weight ~ Height, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.934 -8.236 -0.119 8.260 46.844
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -350.73719 2.11149 -166.1 <2e-16 ***
## Height 7.71729 0.03176 243.0 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.22 on 9998 degrees of freedom
## Multiple R-squared: 0.8552, Adjusted R-squared: 0.8552
## F-statistic: 5.904e+04 on 1 and 9998 DF, p-value: < 2.2e-16
plot(data$Height, data$Weight, xlab = "Height", ylab = "Weight")
abline(height_weight.lm)
Residuals distribution appears to be normally distributed.
Standard error for height coefficient is (7.71729/0.03176) is 242.9877 times smaller than the height coefficient. This large ratio suggests that there is relatively little variability in the slope estimate.
Both the p-value of the intercept and height coefficients are <2e-16 (very small), which means that the probability that they are NOT relevant is the model is really low.
The multiple R-squared is 0.8552, which means that the model explains 85.52% of the variability in data.
From the book:
“A model that fits the data well would tend to over-predict as often as it under-predicts. Thus, if we plot the residual values, we would expect to see them distributed uniformly around zero for a well-fitted model.”
The plot below shows that the residuals look uniformly distributed around zero. The residuals appear to be uniformly scattered above and below zero.
plot(fitted(height_weight.lm), resid(height_weight.lm))
The Q-Q plot suggests that residuals are normally distributed.
qqnorm(resid(height_weight.lm))
qqline(resid(height_weight.lm))