discussion 11
Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?
DATA set :
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
p_wh <- read.csv("weight-height.csv", header = TRUE, sep = ",")
wh <- head(p_wh,100)
wh
## Gender Height Weight
## 1 Male 73.84702 241.8936
## 2 Male 68.78190 162.3105
## 3 Male 74.11011 212.7409
## 4 Male 71.73098 220.0425
## 5 Male 69.88180 206.3498
## 6 Male 67.25302 152.2122
## 7 Male 68.78508 183.9279
## 8 Male 68.34852 167.9711
## 9 Male 67.01895 175.9294
## 10 Male 63.45649 156.3997
## 11 Male 71.19538 186.6049
## 12 Male 71.64081 213.7412
## 13 Male 64.76633 167.1275
## 14 Male 69.28307 189.4462
## 15 Male 69.24373 186.4342
## 16 Male 67.64562 172.1869
## 17 Male 72.41832 196.0285
## 18 Male 63.97433 172.8835
## 19 Male 69.64006 185.9840
## 20 Male 67.93600 182.4266
## 21 Male 67.91505 174.1159
## 22 Male 69.43944 197.7314
## 23 Male 66.14913 149.1736
## 24 Male 75.20597 228.7618
## 25 Male 67.89320 162.0067
## 26 Male 68.14403 192.3440
## 27 Male 69.08963 184.4352
## 28 Male 72.80084 206.8282
## 29 Male 67.42124 175.2139
## 30 Male 68.49642 154.3426
## 31 Male 68.61811 187.5068
## 32 Male 74.03381 212.9102
## 33 Male 71.52822 195.0322
## 34 Male 69.18016 205.1836
## 35 Male 69.57720 204.1641
## 36 Male 70.40093 192.9035
## 37 Male 69.07617 197.4882
## 38 Male 67.19352 183.8110
## 39 Male 65.80732 163.8518
## 40 Male 64.30419 163.1080
## 41 Male 67.97434 172.1356
## 42 Male 72.18943 194.0454
## 43 Male 65.27035 168.6177
## 44 Male 66.09018 161.1934
## 45 Male 67.51032 164.6603
## 46 Male 70.10479 188.9223
## 47 Male 68.25184 187.0606
## 48 Male 72.17271 209.0709
## 49 Male 69.17986 192.0143
## 50 Male 72.87036 211.3425
## 51 Male 64.78258 165.6116
## 52 Male 70.18355 201.0719
## 53 Male 68.49145 173.4240
## 54 Male 67.33083 181.4077
## 55 Male 66.99094 169.7377
## 56 Male 66.49955 163.3095
## 57 Male 68.35306 189.7102
## 58 Male 70.77446 192.1248
## 59 Male 71.21592 198.1985
## 60 Male 70.01337 209.5265
## 61 Male 71.40318 198.7598
## 62 Male 69.55201 198.0795
## 63 Male 73.81853 195.2906
## 64 Male 66.99688 164.9433
## 65 Male 71.41847 179.8639
## 66 Male 65.27930 155.2504
## 67 Male 68.27419 184.5194
## 68 Male 72.76537 220.6780
## 69 Male 68.09938 183.3127
## 70 Male 68.89671 196.4513
## 71 Male 69.28951 184.5956
## 72 Male 70.52322 207.5328
## 73 Male 69.66373 177.2009
## 74 Male 67.59527 163.1080
## 75 Male 72.50812 216.2182
## 76 Male 71.25299 204.6555
## 77 Male 71.80919 200.9206
## 78 Male 72.24517 220.9018
## 79 Male 66.51263 196.4499
## 80 Male 66.02903 168.6408
## 81 Male 67.57715 181.4327
## 82 Male 68.24657 198.6587
## 83 Male 73.82613 237.9167
## 84 Male 69.80246 173.0413
## 85 Male 65.95958 160.6839
## 86 Male 71.07902 188.6029
## 87 Male 66.59620 208.3457
## 88 Male 68.95154 193.4351
## 89 Male 68.24446 174.1097
## 90 Male 72.31683 197.3686
## 91 Male 71.81542 201.6207
## 92 Male 65.23705 181.0120
## 93 Male 70.64053 182.1225
## 94 Male 64.73193 177.5493
## 95 Male 67.10355 164.9746
## 96 Male 65.11748 165.7171
## 97 Male 71.70123 193.0942
## 98 Male 66.83288 180.6839
## 99 Male 66.47128 172.7737
## 100 Male 69.41153 177.4706
regression model :
plot(wh$Weight, wh$Height, xlab='height', ylab='weight', main='heightvsweight')

x <-wh$Height
y <- wh$Weight
wh <- lm(y ~ x) # linear model
qplot(x, y, ylab="weight", xlab="height", main="height vs weight", ymin=-10) + geom_abline(intercept = wh$coefficients[1], slope = wh$coefficients[2])

residual analysis :
ggplot(wh, aes(.fitted, .resid)) + geom_point(color = "red", size=2) +labs(title = "Fitted Values vs Residuals") +labs(x = "Fitted Values") +labs(y = "Residuals")

qqnorm(resid(wh))
qqline(resid(wh))

The plot below shows that the residuals look uniformly distributed around zero. The residuals appear to be uniformly scattered above and below zero.
The Q-Q plot suggests that residuals are normally distributed.
linear model why ? why not ?
summary(wh)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.261 -8.457 -0.921 7.275 35.861
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -217.2013 31.8791 -6.813 7.78e-10 ***
## x 5.8515 0.4614 12.683 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.94 on 98 degrees of freedom
## Multiple R-squared: 0.6214, Adjusted R-squared: 0.6176
## F-statistic: 160.9 on 1 and 98 DF, p-value: < 2.2e-16
R-squared: 0.6214; adjusted R2: 0.6176. This means that speed explains about 62% of the variation in height.
equation of linear model : weight = -217.2013+ 5.8515(height)
hence we can say that relationship is nearly normal.