No class Thursday or next Tuesday (October 15). Please use one of these days as an open lab to work on PS#3.
There is a great way to do everything in R, but they are sometimes specific and kind of hacky. This link might help. http://thomaslevine.com/!/r-spells-for-data-wizards/
Last time I showed you how to conduct statistical tests in R. Functions for producing formal statistical tests are available either from the base packages or from add-on packages.
One more test example:
Example: Professor X takes a random sample of students enrolled in World Geography. She finds the following: there are 25 freshman in the sample, 32 sophomores, 18 juniors, and 20 seniors.
Test the null hypothesis that freshman, sophomores, juniors, and seniors are equally represented among students signed up for the class.
This is a goodness-of-fit test with equal expected frequencies. That is, the null hypothesis is that the population frequencies are the same for each group (freshman, sophomores, …)
obs = c(25, 32, 18, 20)
chisq.test(obs)
##
## Chi-squared test for given probabilities
##
## data: obs
## X-squared = 4.916, df = 3, p-value = 0.1781
You fail to reject the null hypothesis.
Suppose professor Y argues that X is wrong, and that the number of freshman and sophomores enrolled is each twice the number of juniors and the number of seniors. Test Y's hypothesis from these same data.
Now the expected frequencies are no longer equal, so a “p” vector must be specified. Here 1/3, 1/3, 1/6, and 1/6.
null.probs = c(1/3, 1/3, 1/6, 1/6)
chisq.test(obs, p = null.probs)
##
## Chi-squared test for given probabilities
##
## data: obs
## X-squared = 2.8, df = 3, p-value = 0.4235
It doesn't appear that you can reject Y's null hypothesis either.
Note: that is why it is correct to say “I fail to reject the null hypothesis.” And incorrect to say “I accept the null hypothesis.”
Linear regression generalizes a series of statistical tests. Recall the total lung capacity (tlc) dataset from ISwR.
Here we use a t-test to test the hypothesis that there is no difference in total lung capacity between men and women.
require(ISwR)
## Loading required package: ISwR
## Warning: there is no package called 'ISwR'
t.test(tlc ~ sex, data = tlc, var.equal = TRUE)
## Error: object 'tlc' not found
summary(lm(tlc ~ sex, data = tlc))
## Error: object 'tlc' not found
Suppose you have the following values for y (response variable) and x (explanatory) variable.
y = c(2.9, -2.1, -0.5, 2.9, 4.2)
x = 1:5
To make a graph, use the following ggplot() function.
require(ggplot2)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.0.2
p = ggplot(data.frame(x, y), aes(x, y))
Then add a point layer and a horizontal line at the mean of y.
p = p + geom_point(size = 4)
p = p + geom_line(stat = "hline", yintercept = "mean")
p
The mean value represents a model for these data. No individual value fits the model precisely, but the model represents the best guess at what a new value might be.
It makes sense that the model provides a more accurate description of the data when the distances between the values and the mean are smaller.
You add these distances to the graph using vertical line segments.
p = p + geom_segment(aes(y = mean(y), yend = y, x = x, xend = x))
p
Some points are above the mean and others below. Some points are farther from the mean and others are closer. If you sum the distances what do you get? Try it.
Instead you sum the squared differences.
sum((y - mean(y))^2)
## [1] 28.17
The sum of the squared differences is a measure of how well the model fits the data. The smaller the sum, the better the fit.
It is called the sum of the squared errors (SSE). The distances are called “errors.” Statisticians call them residuals.
The SSE equal to 0 indicates that all data points fall on the line.
We add this value to the plot using the text function.
p + geom_text(x = 4, y = -2, label = paste("SSE = ", round(sum((y - mean(y))^2),
1)))
Notice the pattern of the residuals. With the exception of the first value the residuals go from large negative with the 2nd value to large positive with the 5th value.
This pattern begs the question that perhaps we can find a better line through the data. One that is not horizontal but rather slopes upward.
Let's draw another graph.
p1 = ggplot(data.frame(x, y), aes(x, y))
p1 = p1 + geom_point(size = 4)
p1
Next add a sloped line using
p1 = p1 + geom_smooth(method = lm, se = FALSE, col = "red")
p1
This is the linear regression line. We can again look at the distance from a particular value to this new line.
p1 = p1 + geom_segment(aes(y = predict(lm(y ~ x)), yend = y, x = x, xend = x),
col = "red")
p1
The sum of the squared differences for the regression model is
sum(residuals(lm(y ~ x))^2)
## [1] 22.39
p1 + geom_text(x = 4, y = -2, label = paste("SSE = ", round(sum(residuals(lm(y ~
x))^2), 1)))
You can see that the regression line provides a better model for these data in terms of the SSE. The SSE is smaller.
Note: When choosing a model from a set of competing models it is not always this simple.
For each value of x you have a corresponding value for y and a predicted value for y. You call the predicted value y hat.
p1 + geom_point(aes(x, predict(lm(y ~ x))), pch = 15, size = 4)
You refer to \( \hat y_i \) as the predicted value at \( x_i \) and to the estimated regression line as the prediction line.
The difference between the actual value \( y_i \) and this predicted value \( \hat y_i \) is the residual (error), \( e_i \). The residual is the vertical distance from the squares to the circles.
How is the regression line chosen?
The method of least squares determines the line such that the sum of the squared residuals is as small as possible. It minimizes the sum of squared error (SSE).
Mathematically: \[ \hat \beta_1 = \frac{\sum (x_i-\bar x)(y_i-\bar y)}{\sum (x_i-\bar x)^2}\\ \hat \beta_0 = \bar y - \hat \beta_1 \bar x \]
In R code:
beta1 = sum((x - mean(x)) * (y - mean(y)))/sum((x - mean(x))^2)
beta1
## [1] 0.76
beta0 = mean(y) - beta1 * mean(x)
beta0
## [1] -0.8
beta1 (\( \beta_1 \)) and beta0 (\( \beta_0 \)) are called regression model parameters or regression model coefficients. \( \beta_1 \) is the slope parameter and \( \beta_0 \) is the y-intercept parameter.
Note that the regression line goes through the point defined by the mean of x and the mean of y and has a slope equal to \( \beta_1 \).
We could use the above formulas to determine the values of the regression parameters (coefficients). But R uses the lm() function.
The notation is lm(y ~ x). The ~ (tilde) in this notation is read “is modeled by”. So the model formula y ~ x is read “y is modeled by x”.
If we use lm(y ~ x) then we say “y is modeled by x” in a linear way.
lm(y ~ x)
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## -0.80 0.76
The value of \( \beta_0 \) is listed below the word (Intercept) and the value of \( \beta_1 \) is listed below the explanatory variable name (in this case, simply x).
Question: We saw that cor(y,x) = cor(x,y). Is y modeled by x the same as x modeled by y? That is, do we get the same parameter values if we type: lm(x ~ y)? Try it.
With this output we write that the mean of y = -0.8 + 0.76 * x
Data in fat.txt are taken from Johnson, R.W. (1996). “Fitting percentage of body fat to simple body measurements.” Journal of Statistics Education, 4(1).
Various body measurements are taken from 47 subjects. An accurate estimate of the percentage body fat is also recorded for each. Body fat can be modeled with a linear regression.
The various measurements other than body fat recorded on the subjects are ones that are easy to obtain and provide a way to estimate body fat, which is not easily obtained. Here we use all 47 subjects but consider only abdomen circumference as the explanatory variable.
Abdomen circumference is measured in units of centimeters (cm) and is measured at the umbilicus and level with the iliac crest (top of the hip bone). Body fat, a measure of health, is estimated through an expensive underwater weighing technique. Percentage body fat (bodyfat) is obtained using Siriâs equation, 495/density â 450.
loc = "http://myweb.fsu.edu/jelsner/data/fat.txt"
fat = read.table(loc, header = TRUE)
head(fat)
## bodyfat abdomen biceps forearm wrist
## 1 12.6 85.2 32.0 27.4 17.1
## 2 6.9 83.0 30.5 28.9 18.2
## 3 24.6 87.9 28.8 25.2 16.6
## 4 10.9 86.4 32.4 29.4 18.2
## 5 27.8 100.0 32.2 27.7 17.7
## 6 20.6 94.4 35.7 30.6 18.8
The response variable is bodyfat (percentage). That is the variable you are interested in explaining using the other variable. The response variable is always plotted on the vertical axis (y-axis).
The explanatory variable is abdomen (circumference). It is the variable you use to explain the response variable. It is plotted on the horizontal axis (x-axis).
p2 = ggplot(fat, aes(x = abdomen, y = bodyfat)) + geom_point(size = 3) + xlab("Abdomen Circumference (cm)") +
ylab("Body Fat (%)")
p2
You can see that a linear fit of percentage body fat against abdomen circumference makes sense. As abdomen circumference increases so does the percent body fat.
There are an infinite number of lines that can be placed through the set of points. Here you consider two.
p2 = p2 + geom_abline(intercept = 20, slope = 0.1)
p2 = p2 + geom_segment(aes(y = 20 + 0.1 * abdomen, yend = bodyfat, x = abdomen,
xend = abdomen))
p2 = p2 + geom_abline(intercept = -28.56, slope = 0.505, col = "red")
p2 = p2 + geom_segment(aes(y = -28.56 + 0.505 * abdomen, yend = bodyfat, x = abdomen,
xend = abdomen), col = "red")
p2
The black line shown is not a good fit. It is too far above the cluster of points for small values of abdomen. The red line is better. It is the line that minimizes the sum of the squared distances (residuals) between the line and the point.
The sum is minimized when the equation of the line is y = -28.56 + 0.505*x. Algebra and calculus can verify that the least-squares line minimizes the sum of the squares.
It is apparent that body fat is positively related to abdominal circumference. The least-squares line is used to predict body fat from abdomen circumference.
The variance of the residuals (\( s^2 \)) is SSE/(n-2), where n is the sample size. The square root of the variance is called the standard error of estimate, the standard error, and the root mean square error. It indicates the typical vertical deviation of a point from the regression line.
To obtain the regression coefficients (parameters), type
lm(bodyfat ~ abdomen, data = fat)
##
## Call:
## lm(formula = bodyfat ~ abdomen, data = fat)
##
## Coefficients:
## (Intercept) abdomen
## -28.560 0.505
The y-intercept (\( \hat \beta_0 \)) is -28.56 and the slope (\( \hat \beta_1 \)) is +0.505.
The units on the y-intercept are [%] and the units of slope are [%/cm].
You write the equation for the least-squares line as
mean (average) body fat in % = -28.56[%] + 0.505[%/cm] * abdomen circumference[cm]
You interpret the slope value of +0.505 as follows: for every 1 cm increase in abdomen circumference, the average body fat increases by 0.5% points.
Regression is a model for the mean, not for a particular value.
To create a scatter plot of your data include the least-squares regression line, type
p3 = ggplot(fat, aes(x = abdomen, y = bodyfat)) + geom_point(size = 3) + xlab("Abdomen Circumference (cm)") +
ylab("Body Fat (%)")
p3 + geom_smooth(method = lm, se = FALSE)
The maximum heart rate in beats per minute of a person is inversely related to the person's age.
Suppose 15 randomly chosen people of varying ages are tested for maximum heart rate and the following data are found. First create a data frame.
Age = c(18, 23, 25, 35, 65, 54, 34, 56, 72, 19, 23, 42, 18, 39, 37)
HR = c(202, 186, 187, 180, 156, 169, 174, 172, 153, 199, 193, 174, 198, 183,
178)
heart = data.frame(HR = HR, Age = Age)
The age is in years and the heart rate is in beats per minute.
The first question we need to ask is which of the two variables is the response. This question is answered without regard to R or statistics. It must be answered before you model the data. The answer will influence your results.
With the heart rate data you want to predict heart rate given age. In this case HR is the response variable and Age is the explanatory variable.
This is the difference between computing summary statistics and using statistics for drawing inferences. You must first stop and think (e.g., null hypothesis, response, etc).
Next you want to create a scatter plot of the data and add the regression line as a layer. Do this now.
Next you want to determine the equation for the line. You do this using the lm() function by typing
lm(HR ~ Age, data = heart)
##
## Call:
## lm(formula = HR ~ Age, data = heart)
##
## Coefficients:
## (Intercept) Age
## 210.048 -0.798
You interpret the model as follows: On average a person's maximum HR decreases by 8 bpm every 10 years.
Suppose we want to use the regression model to predict the maximum heart rate for a 50-year old.
You can type
210.0485 - 0.7977 * 50
## [1] 170.2
The model predicts that a 50-year old can expect to have a maximum heart rate of 170 bpm. Said another way, given a random set of 50-year olds, the model predicts a mean maximum heart rate of 170 bpm.
The calculation can be done without having to type the coefficients. First save the model as a lm object, then use the predict function.
model = lm(HR ~ Age, data = heart)
predict(model, data.frame(Age = 50))
## 1
## 170.2
This is better since we can input a vector of x values for which we want predictions.
predict(model, data.frame(Age = seq(40, 60, 10)))
## 1 2 3
## 178.1 170.2 162.2
The data set homedata (UsingR) contains assessed values of houses is Maplewood, NH for the years 1970 and 2000.
Suppose you are interested in predicting house values in 2000 from house values in 1970.