We are going to look at how to construct a linear model for a set of data. I will include the syntax for R as well as narrating and explaining the statistical meaning of the tests and results. First we will need data to analyze and work with. We will take the data set MacdonellDF from the package HistData. The following steps are how to import the set. Tools> Install Packages > HistData
library(HistData)
attach(MacdonellDF)
fing <- MacdonellDF
This will attach the data set we are going to use and now we should see what the data looks like. I also gave it a name that is easier to use.
names(fing)
## [1] "height" "finger"
This returns the names of the different categories in the set
head(fing)
## height finger
## 1 4.630208 10.0
## 2 4.713542 10.3
## 3 4.796875 9.9
## 4 4.796875 10.2
## 5 4.796875 10.2
## 6 4.796875 10.3
This shows us a few data pairs from the set to give us a general idea
summary(fing)
## height finger
## Min. :4.630 Min. : 9.50
## 1st Qu.:5.297 1st Qu.:11.20
## Median :5.380 Median :11.50
## Mean :5.420 Mean :11.55
## 3rd Qu.:5.547 3rd Qu.:11.90
## Max. :6.380 Max. :13.50
This command gives us the mean and five number summary from the 2 categories of data.
Now lets visualize the data so that we can see if there might be a linear relationship between the two variabes. We’ll use the height as the dependent variable. So that the graph is easier to read we will also give it labels.
plot(finger, height, ylab = "Height(Feet)", xlab = "mm")
The plot shows there could be a linear relationship but we would need to run additional tests to find out what the relationship is and how to quantify it. To do this we will use linear regression. We will name the model “fingmod”. ##Linear Regression
fingmod <- lm(height~finger)
summary(fingmod)
##
## Call:
## lm(formula = height ~ finger)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.54590 -0.10376 -0.00188 0.10692 1.04906
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.478423 0.061922 40.02 <2e-16 ***
## finger 0.254708 0.005356 47.55 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.161 on 2998 degrees of freedom
## Multiple R-squared: 0.43, Adjusted R-squared: 0.4298
## F-statistic: 2261 on 1 and 2998 DF, p-value: < 2.2e-16
This shows us that for every mm increase in finger length we can expect the height to increase by .2547 feet. The intercept should not be interpreted because we don’t have fingers that can be 0 length naturally. The small p values means that when we test against the hypothesis that the coeffcient of finger is 0, we would reject the null; In other words, finger lenght is signfigant in predicting height. We should plot this on the scatter plot to confirm this relationship.
plot(finger, height, ylab = "Height(Feet)", xlab = "Finger(mm)")
abline(coefficients(fingmod))
This shows us that there is a linear relationship between the two variables. Or in other words, The height of a person can be predicted with the length of their left hand middle finger. We can see it both in the p values and in the graph. We should check that our model is a good fit. We will do this by looking at the residuals.
fingres<- fingmod$residuals
This gives us a residual value for all the data points so now we can look at them as a whole. We can look at the qq plot to check for normality of the residuals.
qqnorm(fingres)
qqline(fingres)
The data all looks close to the line except for a few possible outliers on the upper end. I think that given the high number of data points the qq plot looks good. We should check to make sure our variance is the same for all valuse of finger length. We can do this by plotting the residuals to their finger value.
plot(fingres~finger)
abline(0,0)
THe data seems to have a variance which is independent of the finger length.
Lets predict a height given a finger lenghth. First we need to get a finger lenght to predict on. The first argument in the brackets is the row that the data will be taken from and the second argument is what column the number will be pulled from.
fing[1248, 2]
## [1] 11.4
This returns the finger value from the 1248th data entry. We can then use our linear regression equation to predict the height. We will use the coeffcient command to quickly call these coeffcients.
coefficients(fingmod)
## (Intercept) finger
## 2.4784231 0.2547076
we can plug in our value for finger of 11.4 into the equation.
2.4784231 + 11.4*.2547076
## [1] 5.38209
This returns a predicted value of 5.382 feet for height. Now lets check this against the actual, recall that we used the finger length from the 1248th row.
fing[1248,1]
## [1] 5.380208
We get a value from this but we can do it all in one step and speed the process up.
fing[1248,1]-5.38209
## [1] -0.001881667
This shows us that our prediction was .00188 feet too tall but it was still a good predictor.
We can also predict what a ne data point would be. That is, given a new finger length we can predict the height.
newdata <- data.frame(finger = 11.85)
(predh <- predict(fingmod, newdata, interval="predict") )
## fit lwr upr
## 1 5.496708 5.181045 5.812371
This gives us a point estimate of 5.496 feet and an interval of heights.