hwData <- read.csv(file="D:/data605/hw11/weight-height.csv", header=TRUE, sep=",")
head(hwData)
## Gender Height Weight
## 1 Male 73.84702 241.8936
## 2 Male 68.78190 162.3105
## 3 Male 74.11011 212.7409
## 4 Male 71.73098 220.0425
## 5 Male 69.88180 206.3498
## 6 Male 67.25302 152.2122
dim(hwData)
## [1] 10000 3
There are 10000 observations , height is the explanatory variable and weight is the response variable.
summary(hwData)
## Gender Height Weight
## Length:10000 Min. :54.26 Min. : 64.7
## Class :character 1st Qu.:63.51 1st Qu.:135.8
## Mode :character Median :66.32 Median :161.2
## Mean :66.37 Mean :161.4
## 3rd Qu.:69.17 3rd Qu.:187.2
## Max. :79.00 Max. :270.0
plot(hwData$Height, hwData$Weight, xlab = "Height", ylab = "Weight")
As per the above scatter plot, we could see that there exist some sort of linear relationship between the height and weight.
hw_lm <- lm(Weight ~ Height, data = hwData)
summary(hw_lm)
##
## Call:
## lm(formula = Weight ~ Height, data = hwData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.934 -8.236 -0.119 8.260 46.844
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -350.73719 2.11149 -166.1 <2e-16 ***
## Height 7.71729 0.03176 243.0 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.22 on 9998 degrees of freedom
## Multiple R-squared: 0.8552, Adjusted R-squared: 0.8552
## F-statistic: 5.904e+04 on 1 and 9998 DF, p-value: < 2.2e-16
plot(hwData$Height, hwData$Weight, xlab = "Height", ylab = "Weight")
abline(hw_lm)
From the summary , we could see ‘y’ intercept is -350.73719 and coefficient/slope is 7.71729 , so we could construct a linear equation for this model as
body_weight = -350.73719 + 7.71729* body_height
For a good model, we typically would like to see a standard error that is at least five to ten times smaller than the  corresponding coefficient.
7.71729/0.03176
## [1] 242.9877
The standard error for height is 242.9877 times smaller than the coefficient value ,which means there is relatively little variability in the slope estimate.
Pr(>|t|)
It shows the probability that the corresponding coefficient is not relevant in the model. This value is also known as the significance or p-value of the coefficient.
In this example, the probability that height is not relevant in this model is 2e-16 which is a very small value.
Also Pr(>|t|) for intercept is 2e-16 , which mean the probability that intercept is not relevant is 2e-16.
The reported Multiple R-squared is 0.8552 for this model means that the model explains 85.52% percent of the data’s variation.
Residual value is the difference between the actual measured value stored in the data frame and the value that the fitted regression line predicts for that corresponding data point. Residual analysis examines these residual values to see what they can tell us about the model’s quality
plot(fitted(hw_lm),resid(hw_lm))
We could see that the residuals are mostly uniformly distributed around zero.
For a well-fitted model,we would expect to see them distributed uniformly around zero .
Another test of the residuals uses the quantile-versus-quantile, or Q-Q plot.
If the model fits the data well, we would expect the residuals to be normally (Gaussian) distributed around a mean of zero.
qqnorm(resid(hw_lm))
qqline(resid(hw_lm))
par(mfrow = c(2,2)) # display a unique layout for all graphs
plot(hw_lm)
From the analysis , we could see that the residuals are uniformly distributed around zero and hardly any outliers , also Multiple R-squared is 0.8552. So I think this data set is a good fit for linear model.