We are going to look at how to construct a linear model for a set of data. I will include the syntax for R as well as narrating and explaining the statistical meaning of the tests and results. First we will need data to analyze and work with. We will take the data set MacdonellDF from the package HistData. The following steps are how to import the set. Tools> Install Packages > HistData
library(HistData)
data(MacdonellDF)
attach(MacdonellDF)
fing <- MacdonellDF
This call our data and will attach the data set we are going to use and now we should see what the data looks like. I also gave it a name that is easier to use. The “<-” means that the thing to the right will be called the thing to the left. MacdonellDF is now called fing and is easier to use.
names(fing)
## [1] "height" "finger"
This returns the names of the different categories in the set
head(fing)
## height finger
## 1 4.630208 10.0
## 2 4.713542 10.3
## 3 4.796875 9.9
## 4 4.796875 10.2
## 5 4.796875 10.2
## 6 4.796875 10.3
This shows us a few data pairs from the set to give us a general idea
summary(fing)
## height finger
## Min. :4.630 Min. : 9.50
## 1st Qu.:5.297 1st Qu.:11.20
## Median :5.380 Median :11.50
## Mean :5.420 Mean :11.55
## 3rd Qu.:5.547 3rd Qu.:11.90
## Max. :6.380 Max. :13.50
This command gives us the mean and five number summary from the 2 categories of data.
Now lets visualize the data so that we can see if there might be a linear relationship between the two variabes. We will create a scatterplot and use the height as the dependent variable and finger length as the predictor. So that the graph is easier to read we will also give it labels. Height is our response and finger length is our predictor. The plot command takes two quanitative variables and plots them against each other. The first variable is the response and the second is the predictor. They are seperated by a “~”. You can also add labels to the axis with “xlab” and “ylab”. We should add jitter to the predictors so that the data is easier to see and not bunched up.
We will use the cor command to determine how closely correlated the two varables height and finger length. A number with an absolute value closer to one means that the variables are closely related.
cor(height, finger)
## [1] 0.6557069
We have an r value of .6557069 which means they are positively correlated. The variables are related but not very strongly. If we square our r value we get a coefficient of determination. It means that r^2 amount of variablilty in the response is explained by the predictor variable.
jfing<- jitter(finger)
jheight<- jitter(height)
The jitter function adds small amount of jitter to each of the points so that they are not overlapping. The darker area shows where the data is more concentrated.
plot(jheight~jfing, ylab = "Height(Feet)", xlab = "Finger Length (mm)", main = "Macdonell Scatter Plot" , col = 6 )
We can change the color of the plot points by indicating a number for color, this is done by col = “integer”
The plot shows there could be a positive linear relationship but we would need to run additional tests to find out what the relationship is and how to quantify it. To do this we will use linear regression. We will name the model “fingmod”.
The lm function creates a linear model with the respone variable first and the predictor second. The summary of the model gives us lots of information about the signfigance and point estimates. The model gives us values for out predicted betas. \[\hat{Height}= \hat{\beta_0}+\hat{\beta_1}Finger Length\]
fingmod <- lm(height~finger)
summary(fingmod)
##
## Call:
## lm(formula = height ~ finger)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.54590 -0.10376 -0.00188 0.10692 1.04906
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.478423 0.061922 40.02 <2e-16 ***
## finger 0.254708 0.005356 47.55 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.161 on 2998 degrees of freedom
## Multiple R-squared: 0.43, Adjusted R-squared: 0.4298
## F-statistic: 2261 on 1 and 2998 DF, p-value: < 2.2e-16
The summary contains the point estimate for the intercept and the coeffcient in the estimate column. It also gives the standard error and test statistics.
\[\hat{Height}= 2.478423+ 0.254708*Finger Length\]
This shows us that for every mm increase in finger length we can expect the height to increase by .2547 feet. The intercept should not be interpreted because we don’t have fingers that can be 0 length naturally. The small p values means that when we test against the hypothesis that the coeffcient of finger is 0, we would reject the null; In other words, finger lenght is signfigant in predicting height. We should plot this on the scatter plot to confirm this relationship. We can plot the linear model over our data to see the fit.
Our summary gives us the T-test statistic. We are testing against the null hypothesis that \(H_0: \hat{\beta_1}=0\). If \({\beta_1}=0\) then a change in finger length will have no change on the height of a person, meaning that there is no linear relationship betweem the variables. The alternative hypothesis is, \(H_a:\hat{\beta_1}\neq 0\). R returns both the test statistic and p-value for the test. Using an \({\alpha}=.05\), and the P-value from the summary of 2e-16 we will reject the null and conclude that the variable is signfigant.
We will use the plot, abline, and coefficient commands. The abline creates a line with the first input as the intercept and the second as the slope. We use the coeffcient command instead of inputting them manually. The command returns the coeffcients of the model we input.
plot(jheight~jfing, ylab = "Height(Feet)", xlab = "mm", main = "Macdonell Scatter Plot" , col = 8 )
abline(coefficients(fingmod), col = 4)
This shows us that there is a linear relationship between the two variables, Or in other words, The height of a person can be predicted with the length of their left hand middle finger. We can see it both in the p values and in the graph. We should check that our model is a good fit. We will do this by looking at the residuals.
Lets predict a height given a finger lenghth. First we need to get a finger length to predict on. The first argument in the brackets is the row that the data will be taken from and the second argument is what column the number will be pulled from.
fing[1248, 2]
## [1] 11.4
This returns the finger value from the 1248th data entry. We can then use our linear regression equation to predict the height. We will use the coeffcient command to quickly call these coeffcients.
coefficients(fingmod)
## (Intercept) finger
## 2.4784231 0.2547076
We can plug in our value for finger of 11.4 into the equation.
2.4784231 + 11.4*.2547076
## [1] 5.38209
This returns a predicted value of 5.382 feet for height. Now lets check this against the actual, recall that we used the finger length from the 1248th row.
fing[1248,1]
## [1] 5.380208
We get a value from this but we can do it all in one step and speed the process up.
fing[1248,1]-5.38209
## [1] -0.001881667
This shows us that our prediction was .00188 feet too tall but it was still a good predictor.
We can also predict what a new data point would be. That is, given a new finger length we can predict the height. We do need to be sure to not extrapolate and only predict values inside the range of our data.
newdata <- data.frame(finger = 11.85)
(predh <- predict(fingmod, newdata, interval="predict") )
## fit lwr upr
## 1 5.496708 5.181045 5.812371
This gives us a point estimate of 5.496 feet and an interval of heights.
We want to construct confidence intervals for the estimates that were generated. We will use the command “confint”. We can input our model and it will generate the intervals for the coeffcients. We can also input what our confidence level should be. We will use 95%, Since 95% is the default we don’t need to input anything but we still will for clairty.
confint(fingmod, level = .95)
## 2.5 % 97.5 %
## (Intercept) 2.357009 2.5998369
## finger 0.244205 0.2652102
This means that we are 95% confident that the true estimator for the finger length coeffcient is between .244 and .265. Also if we take a large number of random samples in the same way then 95% of those intervals will contain the true intercept.
We will use our new data to construct a prediction interval. We will use the predict command to construct these intervals. The predict function takes the input (model, new data, type of interval). The default level for this test is 95%. We have already defined a new data frame and we will use this to predict on.
newdata
## finger
## 1 11.85
The height we will predict is for a person with finger length 11.85 mm.
prediction<- predict(fingmod, newdata, interval = "predict" )
confident<- predict(fingmod, newdata, interval = "confidence" )
prediction
## fit lwr upr
## 1 5.496708 5.181045 5.812371
confident
## fit lwr upr
## 1 5.496708 5.490128 5.503288
The prediction interval tells us that we are 95% confident that if a person’s finger is 11.85 mm they will be between 5.181 and 5.812 feet. The confidence interval tells us that we are 95% confident that the mean height for a person with finger length 11.85 mm wil between 5.49 and 5.5 feet tall. The confidence interval is much tighter because we are predicting for a mean instead of a single value. We can see that the point estimates are the same because they use the same model.
It is useful to look at the error of our model. There are a couple ways to do this.
summary(fingmod)
##
## Call:
## lm(formula = height ~ finger)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.54590 -0.10376 -0.00188 0.10692 1.04906
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.478423 0.061922 40.02 <2e-16 ***
## finger 0.254708 0.005356 47.55 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.161 on 2998 degrees of freedom
## Multiple R-squared: 0.43, Adjusted R-squared: 0.4298
## F-statistic: 2261 on 1 and 2998 DF, p-value: < 2.2e-16
The mean square error is .161. This is how the errors are averaged in our model. It is the average of the residuals squared.
Residual is the difference between the actual and predicted value. We can look at the residual for the point that we predicted earlier. When we take our observed value of 5.38209 and subtract the predicted value of 5.382. This gives us a residual of .00188 feet. We will have a residual value for all of the points and the can be useful in evaluating the data.
Residuals<- fingmod$residuals
This command names the residuals from our model. It also gives us a residual value for each predicted point and stores them all in one place.
We need to make sure that all the assumptions hold for the model. The errors need to be normally distributed. We can look at a histogram of the residuals.
hist(Residuals, col = 6)
The hist function gives us a histogram, I input the residuals from the fingmod linear model to visualize the distribution of the residuals. The data looks a little skewwed to the right but with our high amount of data points I think it is safe to continue.
We want our residuals to have a constant variance for the predicted values. If this is true our data is said to have homoscedasticity. We need to jitter the residuals so we can get a sense of the distribution of the residuals.
JResiduals<- jitter(Residuals)
plot(JResiduals~jfing)
abline(0,0)
The data seems to have a variance which is independent of the finger length. The residuals also seem to have a good distribution and don’t depend on the finger length
We can look at the qq plot to check for normality of the residuals. We want the points to be close to the line.
qqnorm(Residuals)
qqline(Residuals, col = 6)
The data all looks close to the line except for a few possible outliers on the upper end. I think that given the high number of data points the qq plot looks good. We should check to make sure our variance is the same for all valuse of finger length. We can do this by plotting the residuals to their finger value.