Simple Linear Regression

Getting Started

We are going to look at how to construct a linear model for a set of data. I will include the syntax for R as well as narrating and explaining the statistical meaning of the tests and results. First we will need data to analyze and work with. We will take the data set MacdonellDF from the package HistData. The following steps are how to import the set. Tools> Install Packages > HistData

library(HistData)
attach(MacdonellDF)
fing <- MacdonellDF

This will attach the data set we are going to use and now we should see what the data looks like. I also gave it a name that is easier to use.

names(fing)
## [1] "height" "finger"

This returns the names of the different categories in the set

head(fing)
##     height finger
## 1 4.630208   10.0
## 2 4.713542   10.3
## 3 4.796875    9.9
## 4 4.796875   10.2
## 5 4.796875   10.2
## 6 4.796875   10.3

This shows us a few data pairs from the set to give us a general idea

summary(fing)
##      height          finger     
##  Min.   :4.630   Min.   : 9.50  
##  1st Qu.:5.297   1st Qu.:11.20  
##  Median :5.380   Median :11.50  
##  Mean   :5.420   Mean   :11.55  
##  3rd Qu.:5.547   3rd Qu.:11.90  
##  Max.   :6.380   Max.   :13.50

This command gives us the mean and five number summary from the 2 categories of data.

Now lets visualize the data so that we can see if there might be a linear relationship between the two variabes. We’ll use the height as the dependent variable. So that the graph is easier to read we will also give it labels.

plot(finger, height, ylab = "Height(Feet)", xlab = "mm")

The plot shows there could be a linear relationship but we would need to run additional tests to find out what the relationship is and how to quantify it. To do this we will use linear regression. We will name the model “fingmod”. ##Linear Regression

fingmod <- lm(height~finger)
summary(fingmod)
## 
## Call:
## lm(formula = height ~ finger)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.54590 -0.10376 -0.00188  0.10692  1.04906 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.478423   0.061922   40.02   <2e-16 ***
## finger      0.254708   0.005356   47.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.161 on 2998 degrees of freedom
## Multiple R-squared:   0.43,  Adjusted R-squared:  0.4298 
## F-statistic:  2261 on 1 and 2998 DF,  p-value: < 2.2e-16

This shows us that for every mm increase in finger length we can expect the height to increase by .2547 feet. The intercept should not be interpreted because we don’t have fingers that can be 0 length naturally. The small p values means that when we test against the hypothesis that the coeffcient of finger is 0, we would reject the null; In other words, finger lenght is signfigant in predicting height. We should plot this on the scatter plot to confirm this relationship.

plot(finger, height, ylab = "Height(Feet)", xlab = "Finger(mm)")
abline(coefficients(fingmod))

This shows us that there is a linear relationship between the two variables. Or in other words, The height of a person can be predicted with the length of their left hand middle finger. We can see it both in the p values and in the graph. We should check that our model is a good fit. We will do this by looking at the residuals.

fingres<- fingmod$residuals

This gives us a residual value for all the data points so now we can look at them as a whole. We can look at the qq plot to check for normality of the residuals.

 qqnorm(fingres)
 qqline(fingres)

The data all looks close to the line except for a few possible outliers on the upper end. I think that given the high number of data points the qq plot looks good. We should check to make sure our variance is the same for all valuse of finger length. We can do this by plotting the residuals to their finger value.

plot(fingres~finger)
abline(0,0)

THe data seems to have a variance which is independent of the finger length.

Prediction

Lets predict a height given a finger lenghth. First we need to get a finger lenght to predict on. The first argument in the brackets is the row that the data will be taken from and the second argument is what column the number will be pulled from.

fing[1248, 2]
## [1] 11.4

This returns the finger value from the 1248th data entry. We can then use our linear regression equation to predict the height. We will use the coeffcient command to quickly call these coeffcients.

coefficients(fingmod)
## (Intercept)      finger 
##   2.4784231   0.2547076

we can plug in our value for finger of 11.4 into the equation.

2.4784231 + 11.4*.2547076
## [1] 5.38209

This returns a predicted value of 5.382 feet for height. Now lets check this against the actual, recall that we used the finger length from the 1248th row.

fing[1248,1]
## [1] 5.380208

We get a value from this but we can do it all in one step and speed the process up.

fing[1248,1]-5.38209
## [1] -0.001881667

This shows us that our prediction was .00188 feet too tall but it was still a good predictor.

We can also predict what a ne data point would be. That is, given a new finger length we can predict the height.

newdata <- data.frame(finger = 11.85)
(predh <- predict(fingmod, newdata, interval="predict") )
##        fit      lwr      upr
## 1 5.496708 5.181045 5.812371

This gives us a point estimate of 5.496 feet and an interval of heights.