What I learned from the RegressionIntro:
I learned that it is really helpful to graph data in order to analyze it. In my past stats classes I would kind of just look at the numbers to try and determing a pattern or a trend. Sometimes we would graph the points and look at the actual data. What was really helpful here was graphing the residual. This gives a nice idea of how well the approximation works because it tells us how close the expected values were to the real values. I really enjoy using RStudio and I think it works really nice for statistical analysis.
Upload and take a look at the data
data(women)
#head(women)
#names(women)
attach(women)
Scatterplot
plot(weight, height, ylab = "Height in inches", xlab = "Weight in Pounds", main = "Height and Weight of women")
Trends: There is definetaly a trend. As the height of the women increases, the weight of the women increase. There is most certainly a strong, positive correlation. I would guess it is close to positive .8 or .9. The data is packed tightly and there does not appear to be any outliers. There is a very slight curvature to the data, but I would say overall a linear approximation would work very well with this data.
Linear Approximation
WomenData <- lm(height ~ weight)
WomenData
##
## Call:
## lm(formula = height ~ weight)
##
## Coefficients:
## (Intercept) weight
## 25.7235 0.2872
The intecept should be (0,0). If there is a woman who is 0 inches tall, they should weigh 0 pounds.
plot(weight, height, ylab = "Height in inches", xlab = "Weight in Pounds", main = "Height and Weight of women")
abline(25.7235, .2872)
Predictions. Weight is 115. Predict the height
women[ , ]
## height weight
## 1 58 115
## 2 59 117
## 3 60 120
## 4 61 123
## 5 62 126
## 6 63 129
## 7 64 132
## 8 65 135
## 9 66 139
## 10 67 142
## 11 68 146
## 12 69 150
## 13 70 154
## 14 71 159
## 15 72 164
#height = 25.7235 + .2872*weight
estHfor115 <- 25.7235 + .2872*115
estHfor115
## [1] 58.7515
realHfor115 <- height == 115
realHfor115 <- women[1, 1]
realHfor115
## [1] 58
residule
residule115 <- realHfor115 - estHfor115
residule115
## [1] -0.7515
This means that our estimate was a little higher than the real value. This looks to be true from the graph.
women[1 , "weight"]
## [1] 115
women [2, "height"]
## [1] 59
indexing
women[1, ]
## height weight
## 1 58 115
women[2, ]
## height weight
## 2 59 117
women[ , 1]
## [1] 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
women[ , 2]
## [1] 115 117 120 123 126 129 132 135 139 142 146 150 154 159 164
residual
height.actual <- women[5,1]
height.actual
## [1] 62
height.pred <- 25.7235 + .2872* women[5,2]
height.pred
## [1] 61.9107
resheight <- height.actual - height.pred
resheight
## [1] 0.0893
My estimate was a little low.
which(height == 58)
## [1] 1
which(height == 62)
## [1] 5
which(height == 70)
## [1] 13
which(height == 71)
## [1] 14
#now one that doesn't exist:
which(height == 100)
## integer(0)
It’s let’s me know which row each of the values is found in.
Now, let’s look at all the residuals–
WomenData
##
## Call:
## lm(formula = height ~ weight)
##
## Coefficients:
## (Intercept) weight
## 25.7235 0.2872
Womenresids <- WomenData$residuals
Womenresids
## 1 2 3 4 5 6
## -0.75711680 -0.33161526 -0.19336294 -0.05511062 0.08314170 0.22139402
## 7 8 9 10 11 12
## 0.35964634 0.49789866 0.34890175 0.48715407 0.33815716 0.18916026
## 13 14 15
## 0.04016335 -0.39608278 -0.83232892
Now, let’s make a histogram of the residuals
hist(Womenresids)
Now using the lines
qqnorm(Womenresids)
qqline(Womenresids)
I’d say the points are preety good. The at least follow a linear pattern. Of course, they are not right on the line. The line is a normal distribution.
Let’s plot this again
plot(weight, height, ylab = "Height in inches", xlab = "Weight in Pounds", main = "Height and Weight of women")
And now add the residual line
plot(WomenData$residuals ~ weight)
abline(0,0)
That looks terrible. I’m not sure what it should look like.
sqrt(sum((WomenData$residuals)^2)/8)
## [1] 0.5609419
Summary of the Women Data
summary(WomenData)
##
## Call:
## lm(formula = height ~ weight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.83233 -0.26249 0.08314 0.34353 0.49790
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.723456 1.043746 24.64 2.68e-12 ***
## weight 0.287249 0.007588 37.85 1.09e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.44 on 13 degrees of freedom
## Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
## F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14