What I learned from the RegressionIntro:

I learned that it is really helpful to graph data in order to analyze it. In my past stats classes I would kind of just look at the numbers to try and determing a pattern or a trend. Sometimes we would graph the points and look at the actual data. What was really helpful here was graphing the residual. This gives a nice idea of how well the approximation works because it tells us how close the expected values were to the real values. I really enjoy using RStudio and I think it works really nice for statistical analysis.

Upload and take a look at the data

data(women)
#head(women)
#names(women)
attach(women)

Scatterplot

plot(weight, height, ylab = "Height in inches", xlab = "Weight in Pounds", main = "Height and Weight of women")

Trends: There is definetaly a trend. As the height of the women increases, the weight of the women increase. There is most certainly a strong, positive correlation. I would guess it is close to positive .8 or .9. The data is packed tightly and there does not appear to be any outliers. There is a very slight curvature to the data, but I would say overall a linear approximation would work very well with this data.

Linear Approximation

WomenData <- lm(height ~ weight)
WomenData
## 
## Call:
## lm(formula = height ~ weight)
## 
## Coefficients:
## (Intercept)       weight  
##     25.7235       0.2872

The intecept should be (0,0). If there is a woman who is 0 inches tall, they should weigh 0 pounds.

plot(weight, height, ylab = "Height in inches", xlab = "Weight in Pounds", main = "Height and Weight of women")
abline(25.7235, .2872)

Predictions. Weight is 115. Predict the height

women[ , ]
##    height weight
## 1      58    115
## 2      59    117
## 3      60    120
## 4      61    123
## 5      62    126
## 6      63    129
## 7      64    132
## 8      65    135
## 9      66    139
## 10     67    142
## 11     68    146
## 12     69    150
## 13     70    154
## 14     71    159
## 15     72    164
#height = 25.7235 + .2872*weight

estHfor115 <- 25.7235 + .2872*115
estHfor115 
## [1] 58.7515
realHfor115 <- height == 115
realHfor115 <- women[1, 1]
realHfor115
## [1] 58

residule

residule115 <- realHfor115 - estHfor115
residule115
## [1] -0.7515

This means that our estimate was a little higher than the real value. This looks to be true from the graph.

women[1 , "weight"]
## [1] 115
women [2, "height"]
## [1] 59

indexing

women[1, ]
##   height weight
## 1     58    115
women[2, ]
##   height weight
## 2     59    117
women[ , 1]
##  [1] 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
women[ , 2]
##  [1] 115 117 120 123 126 129 132 135 139 142 146 150 154 159 164

residual

height.actual <- women[5,1]
height.actual
## [1] 62
height.pred <- 25.7235 + .2872* women[5,2]
height.pred
## [1] 61.9107
resheight <- height.actual - height.pred
resheight
## [1] 0.0893

My estimate was a little low.

which(height == 58)
## [1] 1
which(height == 62)
## [1] 5
which(height == 70)
## [1] 13
which(height == 71)
## [1] 14
#now one that doesn't exist:
which(height == 100)
## integer(0)

It’s let’s me know which row each of the values is found in.

Now, let’s look at all the residuals–

WomenData
## 
## Call:
## lm(formula = height ~ weight)
## 
## Coefficients:
## (Intercept)       weight  
##     25.7235       0.2872
Womenresids <- WomenData$residuals
Womenresids
##           1           2           3           4           5           6 
## -0.75711680 -0.33161526 -0.19336294 -0.05511062  0.08314170  0.22139402 
##           7           8           9          10          11          12 
##  0.35964634  0.49789866  0.34890175  0.48715407  0.33815716  0.18916026 
##          13          14          15 
##  0.04016335 -0.39608278 -0.83232892

Now, let’s make a histogram of the residuals

hist(Womenresids)

Now using the lines

qqnorm(Womenresids)
qqline(Womenresids)

I’d say the points are preety good. The at least follow a linear pattern. Of course, they are not right on the line. The line is a normal distribution.

Let’s plot this again

plot(weight, height, ylab = "Height in inches", xlab = "Weight in Pounds", main = "Height and Weight of women")

And now add the residual line

plot(WomenData$residuals ~ weight)
abline(0,0)

That looks terrible. I’m not sure what it should look like.

sqrt(sum((WomenData$residuals)^2)/8)
## [1] 0.5609419

Summary of the Women Data

summary(WomenData)
## 
## Call:
## lm(formula = height ~ weight)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.83233 -0.26249  0.08314  0.34353  0.49790 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 25.723456   1.043746   24.64 2.68e-12 ***
## weight       0.287249   0.007588   37.85 1.09e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.44 on 13 degrees of freedom
## Multiple R-squared:  0.991,  Adjusted R-squared:  0.9903 
## F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14