Data 605 Week11-Discussion

Data

The data set contains the height and weight of 10000 people.

Link to the Dataset

Data Analysis

hwData <- read.csv(file="D:/data605/hw11/weight-height.csv", header=TRUE, sep=",")
head(hwData)

##   Gender   Height   Weight
## 1   Male 73.84702 241.8936
## 2   Male 68.78190 162.3105
## 3   Male 74.11011 212.7409
## 4   Male 71.73098 220.0425
## 5   Male 69.88180 206.3498
## 6   Male 67.25302 152.2122

dim(hwData)

## [1] 10000     3

There are 10000 observations , height is the explanatory variable and weight is the response variable.

summary(hwData)

##     Gender              Height          Weight     
##  Length:10000       Min.   :54.26   Min.   : 64.7  
##  Class :character   1st Qu.:63.51   1st Qu.:135.8  
##  Mode  :character   Median :66.32   Median :161.2  
##                     Mean   :66.37   Mean   :161.4  
##                     3rd Qu.:69.17   3rd Qu.:187.2  
##                     Max.   :79.00   Max.   :270.0

plot(hwData$Height, hwData$Weight, xlab = "Height", ylab = "Weight")

As per the above scatter plot, we could see that there exist some sort of linear relationship between the height and weight.

Linear Model

hw_lm <- lm(Weight ~ Height, data = hwData)
summary(hw_lm)

## 
## Call:
## lm(formula = Weight ~ Height, data = hwData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.934  -8.236  -0.119   8.260  46.844 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -350.73719    2.11149  -166.1   <2e-16 ***
## Height         7.71729    0.03176   243.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.22 on 9998 degrees of freedom
## Multiple R-squared:  0.8552, Adjusted R-squared:  0.8552 
## F-statistic: 5.904e+04 on 1 and 9998 DF,  p-value: < 2.2e-16

plot(hwData$Height, hwData$Weight, xlab = "Height", ylab = "Weight")
abline(hw_lm)

Analysis of the linear model

From the summary , we could see ‘y’ intercept is -350.73719 and coefficient/slope is 7.71729 , so we could construct a linear equation for this model as

body_weight = -350.73719 + 7.71729* body_height

Evaluating the quality of the model

For a good model, we typically would like to see a standard error that is at least five to ten times smaller than the corresponding coefficient.

7.71729/0.03176

## [1] 242.9877

The standard error for height is 242.9877 times smaller than the coefficient value ,which means there is relatively little variability in the slope estimate.

Pr(>|t|)

It shows the probability that the corresponding coefficient is not relevant in the model. This value is also known as the significance or p-value of the coefficient.

In this example, the probability that height is not relevant in this model is 2e-16 which is a very small value.

Also Pr(>|t|) for intercept is 2e-16 , which mean the probability that intercept is not relevant is 2e-16.

The reported Multiple R-squared is 0.8552 for this model means that the model explains 85.52% percent of the data’s variation.

Residual Analysis

Residual value is the difference between the actual measured value stored in the data frame and the value that the fitted regression line predicts for that corresponding data point. Residual analysis examines these residual values to see what they can tell us about the model’s quality

plot(fitted(hw_lm),resid(hw_lm))

We could see that the residuals are mostly uniformly distributed around zero.

For a well-fitted model,we would expect to see them distributed uniformly around zero .

Q-Q plot

Another test of the residuals uses the quantile-versus-quantile, or Q-Q plot.

If the model fits the data well, we would expect the residuals to be normally (Gaussian) distributed around a mean of zero.

qqnorm(resid(hw_lm))
qqline(resid(hw_lm))

par(mfrow = c(2,2)) # display a unique layout for all graphs
plot(hw_lm)

Summary

From the analysis , we could see that the residuals are uniformly distributed around zero and hardly any outliers , also Multiple R-squared is 0.8552. So I think this data set is a good fit for linear model.