About Data

Data: Height and Weight

Source: https://www.kaggle.com/mustafaali96/weight-height

One Factor Linear Regression

This analysis looks at the relationship between Height and Weight. The independent variable is Height and dependent variable is Weight.

Load Data

data <- read.csv(file="C:/Users/stina/Desktop/Fall 2018/Data 605/hw11/weight-height.csv", header=TRUE, sep=",")
head(data, 20)
##    Gender   Height   Weight
## 1    Male 73.84702 241.8936
## 2    Male 68.78190 162.3105
## 3    Male 74.11011 212.7409
## 4    Male 71.73098 220.0425
## 5    Male 69.88180 206.3498
## 6    Male 67.25302 152.2122
## 7    Male 68.78508 183.9279
## 8    Male 68.34852 167.9711
## 9    Male 67.01895 175.9294
## 10   Male 63.45649 156.3997
## 11   Male 71.19538 186.6049
## 12   Male 71.64081 213.7412
## 13   Male 64.76633 167.1275
## 14   Male 69.28307 189.4462
## 15   Male 69.24373 186.4342
## 16   Male 67.64562 172.1869
## 17   Male 72.41832 196.0285
## 18   Male 63.97433 172.8835
## 19   Male 69.64006 185.9840
## 20   Male 67.93600 182.4266

Plot Weight and Height

Weight increases as height increases.

plot(data$Height, data$Weight, xlab = "Height", ylab = "Weight")

Linear Model

Create one factor linear model.

Intercept is -350.73719. Slope is 7.71729.

Weight = -350.73719 + 7.71729 * Height

height_weight.lm <- lm(Weight ~ Height, data = data)
summary(height_weight.lm)
## 
## Call:
## lm(formula = Weight ~ Height, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.934  -8.236  -0.119   8.260  46.844 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -350.73719    2.11149  -166.1   <2e-16 ***
## Height         7.71729    0.03176   243.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.22 on 9998 degrees of freedom
## Multiple R-squared:  0.8552, Adjusted R-squared:  0.8552 
## F-statistic: 5.904e+04 on 1 and 9998 DF,  p-value: < 2.2e-16

Plot of Linear Model

plot(data$Height, data$Weight, xlab = "Height", ylab = "Weight")
abline(height_weight.lm)

Summary of Linear Model

Residuals distribution appears to be normally distributed.

Standard error for height coefficient is (7.71729/0.03176) is 242.9877 times smaller than the height coefficient. This large ratio suggests that there is relatively little variability in the slope estimate.

Both the p-value of the intercept and height coefficients are <2e-16 (very small), which means that the probability that they are NOT relevant is the model is really low.

The multiple R-squared is 0.8552, which means that the model explains 85.52% of the variability in data.

Plot of Residuals

From the book:

“A model that fits the data well would tend to over-predict as often as it under-predicts. Thus, if we plot the residual values, we would expect to see them distributed uniformly around zero for a well-fitted model.”

The plot below shows that the residuals look uniformly distributed around zero. The residuals appear to be uniformly scattered above and below zero.

plot(fitted(height_weight.lm), resid(height_weight.lm))

Normal Q-Q Plot

The Q-Q plot suggests that residuals are normally distributed.

qqnorm(resid(height_weight.lm))
qqline(resid(height_weight.lm))