discussion 11

Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

DATA set :

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
p_wh <- read.csv("weight-height.csv", header = TRUE, sep = ",")
wh <- head(p_wh,100)
wh
##     Gender   Height   Weight
## 1     Male 73.84702 241.8936
## 2     Male 68.78190 162.3105
## 3     Male 74.11011 212.7409
## 4     Male 71.73098 220.0425
## 5     Male 69.88180 206.3498
## 6     Male 67.25302 152.2122
## 7     Male 68.78508 183.9279
## 8     Male 68.34852 167.9711
## 9     Male 67.01895 175.9294
## 10    Male 63.45649 156.3997
## 11    Male 71.19538 186.6049
## 12    Male 71.64081 213.7412
## 13    Male 64.76633 167.1275
## 14    Male 69.28307 189.4462
## 15    Male 69.24373 186.4342
## 16    Male 67.64562 172.1869
## 17    Male 72.41832 196.0285
## 18    Male 63.97433 172.8835
## 19    Male 69.64006 185.9840
## 20    Male 67.93600 182.4266
## 21    Male 67.91505 174.1159
## 22    Male 69.43944 197.7314
## 23    Male 66.14913 149.1736
## 24    Male 75.20597 228.7618
## 25    Male 67.89320 162.0067
## 26    Male 68.14403 192.3440
## 27    Male 69.08963 184.4352
## 28    Male 72.80084 206.8282
## 29    Male 67.42124 175.2139
## 30    Male 68.49642 154.3426
## 31    Male 68.61811 187.5068
## 32    Male 74.03381 212.9102
## 33    Male 71.52822 195.0322
## 34    Male 69.18016 205.1836
## 35    Male 69.57720 204.1641
## 36    Male 70.40093 192.9035
## 37    Male 69.07617 197.4882
## 38    Male 67.19352 183.8110
## 39    Male 65.80732 163.8518
## 40    Male 64.30419 163.1080
## 41    Male 67.97434 172.1356
## 42    Male 72.18943 194.0454
## 43    Male 65.27035 168.6177
## 44    Male 66.09018 161.1934
## 45    Male 67.51032 164.6603
## 46    Male 70.10479 188.9223
## 47    Male 68.25184 187.0606
## 48    Male 72.17271 209.0709
## 49    Male 69.17986 192.0143
## 50    Male 72.87036 211.3425
## 51    Male 64.78258 165.6116
## 52    Male 70.18355 201.0719
## 53    Male 68.49145 173.4240
## 54    Male 67.33083 181.4077
## 55    Male 66.99094 169.7377
## 56    Male 66.49955 163.3095
## 57    Male 68.35306 189.7102
## 58    Male 70.77446 192.1248
## 59    Male 71.21592 198.1985
## 60    Male 70.01337 209.5265
## 61    Male 71.40318 198.7598
## 62    Male 69.55201 198.0795
## 63    Male 73.81853 195.2906
## 64    Male 66.99688 164.9433
## 65    Male 71.41847 179.8639
## 66    Male 65.27930 155.2504
## 67    Male 68.27419 184.5194
## 68    Male 72.76537 220.6780
## 69    Male 68.09938 183.3127
## 70    Male 68.89671 196.4513
## 71    Male 69.28951 184.5956
## 72    Male 70.52322 207.5328
## 73    Male 69.66373 177.2009
## 74    Male 67.59527 163.1080
## 75    Male 72.50812 216.2182
## 76    Male 71.25299 204.6555
## 77    Male 71.80919 200.9206
## 78    Male 72.24517 220.9018
## 79    Male 66.51263 196.4499
## 80    Male 66.02903 168.6408
## 81    Male 67.57715 181.4327
## 82    Male 68.24657 198.6587
## 83    Male 73.82613 237.9167
## 84    Male 69.80246 173.0413
## 85    Male 65.95958 160.6839
## 86    Male 71.07902 188.6029
## 87    Male 66.59620 208.3457
## 88    Male 68.95154 193.4351
## 89    Male 68.24446 174.1097
## 90    Male 72.31683 197.3686
## 91    Male 71.81542 201.6207
## 92    Male 65.23705 181.0120
## 93    Male 70.64053 182.1225
## 94    Male 64.73193 177.5493
## 95    Male 67.10355 164.9746
## 96    Male 65.11748 165.7171
## 97    Male 71.70123 193.0942
## 98    Male 66.83288 180.6839
## 99    Male 66.47128 172.7737
## 100   Male 69.41153 177.4706

regression model :

plot(wh$Weight, wh$Height, xlab='height', ylab='weight',  main='heightvsweight')

x <-wh$Height  
y <- wh$Weight  
wh <- lm(y ~ x)   # linear model
qplot(x, y, ylab="weight", xlab="height", main="height vs weight", ymin=-10) + geom_abline(intercept = wh$coefficients[1],  slope = wh$coefficients[2])

residual analysis :

ggplot(wh, aes(.fitted, .resid)) + geom_point(color = "red", size=2) +labs(title = "Fitted Values vs Residuals") +labs(x = "Fitted Values") +labs(y = "Residuals")

qqnorm(resid(wh))
qqline(resid(wh))

The plot below shows that the residuals look uniformly distributed around zero. The residuals appear to be uniformly scattered above and below zero.

The Q-Q plot suggests that residuals are normally distributed.

linear model why ? why not ?

summary(wh)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.261  -8.457  -0.921   7.275  35.861 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -217.2013    31.8791  -6.813 7.78e-10 ***
## x              5.8515     0.4614  12.683  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.94 on 98 degrees of freedom
## Multiple R-squared:  0.6214, Adjusted R-squared:  0.6176 
## F-statistic: 160.9 on 1 and 98 DF,  p-value: < 2.2e-16

R-squared: 0.6214; adjusted R2: 0.6176. This means that speed explains about 62% of the variation in height.

equation of linear model : weight = -217.2013+ 5.8515(height)

hence we can say that relationship is nearly normal.