Simple Linear Regression

RPub #2 in a series on data mining with R by

Karen Mazidi

This RPub is the second in my series on data mining here at http://rpubs.com/kjmazidi

Further information can be found on my blog: http://www.karenmazidi.com/blog.html

In the first post we downloaded the brain head data set and added column headings. Before moving on to linear regression, let's reload everything.

Loading the data

# get the data if it is not already in the data folder
data_file = "data/brainhead.dat"
if (!file.exists(data_file)) {
  dir.create(dirname(data_file), FALSE)
  download.file("http://www.stat.ufl.edu/~winner/data/brainhead.dat", destfile=data_file)
  }
# load the data into R
brain = read.table(data_file, header=FALSE)
# add column headings
colnames(brain) <- c("Gender", "Age", "Head","Brain")
attach(brain)  
# make sure Gender and Age are treated as qualitative data
Age = as.factor(Age)
Gender = as.factor(Gender)

Looking at the data

Let's take a look at the data in two ways. First, a peek at the first few rows, then a scatterplot of all the head size and brain weight. Then we'll quantify the correlation that appears in the plot. The correlation we see between head size and brain weight suggests that linear regression would be a good technique to model this correlation.

# take a peek at the data
head(brain)

##   Gender Age Head Brain
## 1      1   1 4512  1530
## 2      1   1 3738  1297
## 3      1   1 4261  1335
## 4      1   1 3777  1282
## 5      1   1 4177  1590
## 6      1   1 3585  1300

# plot
plot(Brain~Head, brain, xlab="Head Size in cm^3", ylab="Brain Weight in grams", main="Plot of Brain Weight as a Function of Head Size")

plot of chunk unnamed-chunk-2

# correlation
cor(Head, Brain)

## [1] 0.7995697

Linear Regression

It is fitting that here we are using a quaint data set from the early 1900s when in fact development of the techniques of linear regression date back to the same time period. Although linear regression has been around for so long it is still widely used and forms the basis of more complicated techniques we will explore in later posts.

Linear regression models the relationship between a dependent variable y, sometimes called the response variable, and independent variable(s) x, called predictor(s). If there is only one x, the technique is called simple linear regression. If there is more than one predictor variables x, the technique is called multiple linear regression. For an in-depth discussion of the technique, I recommend the following book: http://www-bcf.usc.edu/~gareth/ISL/

Linear regression falls into the category of supervised learning because we know the values of y for our training data. We hope to learn a model so that for future, unseen values of x, we can predict values of y. This linear function is written as:

\( Y = \beta_0 + \beta_1 X + \epsilon \)

Where \( \beta_0 \) is the expected value when \( X = 0 \), \( \beta_1 \) is the slope, the average change in Y associated with a one-unit change in X, and \( \epsilon \) is an error term. We need an error term because as George Box famously said: all models are wrong, but some are useful.

The coefficients \( \beta_0 \) and \( \beta_1 \) are unknown; they are estimated from the training data using the least squares approach, which seeks to minimize the residual sum of squares (RSS). You can visualize the residuals as lines drawn from the points to the regression line. The shorter these lines are, the better our line of fit.

Linear regression in R

The lm() function is used to fit a linear model. As with any R function you can use >help(lm) to get detailed information about it. There are several possible arguments to the function. The first argument is the formula. In the example below we used the formula Brain \( \sim \) Head, which means that the response Brain is modeled as a function of variable Head. The data=brain parameter instructions R to use the brain data.

When we display model1, R gives us the coefficients. So our linear model estimate is:

Y = 325.5734 + 0.2634 X

model1 = lm(Brain~Head, data=brain)
model1

## 
## Call:
## lm(formula = Brain ~ Head, data = brain)
## 
## Coefficients:
## (Intercept)         Head  
##    325.5734       0.2634

Let's see what that looks like.

plot(Brain~Head, brain, xlab="Head Size in cm^3", ylab="Brain Weight in grams", main="Plot of Brain Weight as a Function of Head Size")
# draw the model line
abline(model1, col="blue")

plot of chunk unnamed-chunk-4

Model summary

The model line in the graph above looks pretty good, but how good is the model? If we run summary() on our model, we can get information about the goodness of fit.

Looking at the output of summary() below, we see that the median of the residuals is -1.76, not bad, and the max and min represent quite a range. Hmmm.

The next section describes the coefficients. We are not that interested in the intercept but in the coefficients of our variable(s) because we want to know how well these variables can predict the response. R displays the standard error, t-value, and the probability of getting that t-statistic. The t-statistic measures the number of standard deviations our coefficient is from 0, and 0 of course means that there is no correlation between the response and the predictor. Generally, if our standard error is large, we will need a coefficient with a large absolute value in order to determine that it is significant. The Pr(>|t|) value is the probability of observing any value >= |t|. A small value indicates that it is unlikely to observe such a strong association between the predictor and the response due to chance. Typical cut-off values that enable us to reject the null hypothesis are 5% or 1%.

The coefficient statistics are following by signficiance codes. The three * indicate that this is significant.

Finally, R provides the RSE, R-squared, and F-statistic. The RSE (residual standard error) is a measure of the lack of fit of the model to the data; it is the average amount that the response deviates from the true regression line. The \( R^2 \) statistic is also a measure of fit. RSE is measured in units of Y so it's hard to tell what a good RSE is. In contrast, the \( R^2 \) statistic takes on values between 0 and 1 and is independent of the scale of Y. \( R^2 \) measures the proportion of variance explained by the model. A number near 0 means that the regression did not explain much of the variance in the response. What a “good” \( R^2 \) statistic is may depend upon the application. For simple linear regression with only one predictor, the square root of \( R^2 \) is the same as r the Pearson correlation coefficient.

The F-statistic is a measure of the average variability in the data that is explained by the model to the average variablity that is unexplained. The F-statistic measures how much the model improved the prediction of the response compared to the inaccuracy that exists in the model. A good model should have an F-statistic that is at least greater than 1. The p-value for the F-statistic is very low, giving further indication that we have a good model. A p-value of 0.05 does NOT mean that there is a 95% chance that a given hypothesis is correct. The p-value is the probability of obtaining an effect at least as extreme as the one in the sample data, assuming the truth of the null hypothesis. So if the p-value is small we reject the null hypothesis. The p-value is just one of many statistics to evaluate.

summary(model1)

## 
## Call:
## lm(formula = Brain ~ Head, data = brain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -175.98  -49.76   -1.76   46.60  242.34 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 325.57342   47.14085   6.906 4.61e-11 ***
## Head          0.26343    0.01291  20.409  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 72.43 on 235 degrees of freedom
## Multiple R-squared:  0.6393, Adjusted R-squared:  0.6378 
## F-statistic: 416.5 on 1 and 235 DF,  p-value: < 2.2e-16

A warning about p-values

In March 2016 the American Statistical Association released a set of principles relating to p-values, out of their concern that too many papers were being published based primarily on the fact that a p-value < 0.05 was achieved. Here is a link to their paper: http://dx.doi.org/10.1080/00031305.2016.1154108

The paper listed 6 priniples regarding the use and misue of p-values:

p-values can indicate how incompatible the data are with a specific statistical model
p-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produces by random chance alone
scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold
proper inference requires full reporting and transparency
a p-value, or statistical significance, does not measure the size of an effect or the importance of a result
by itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis

Confidence intervals

R uses the function confint() to calculate confidence intervals.

The confidence interval is a range of values around a given statistic that are believed to contain the true value of that statistic, with a certain degree of probability.

For example, the output below shows a 95% confidence interval for \( \beta_1 \) (Head) from 0.2380003 to 0.288584. This means there is approximately a 95% chance that the true \( \beta_1 \) lies in that interval.

confint(model1)

##                   2.5 %      97.5 %
## (Intercept) 232.7007553 418.4460868
## Head          0.2380003   0.2888584

Prediction

One of the nice things about linear regression is that it builds a model that we can use to predict response variable values for unseen predictor values. The following example uses the predict function in R for this purpose. The mean Head size was 3633.992. Let's take some values around that mean and predict brain size. The data.frame(Head=c(3500, 3600, 3700)) argument builds a data frame on the fly as inputs. We have also told R that we want to use the confidence intervals. The output shows the prediction in the fit column with lower and upper bounds of the confidence interval.

predict(model1, data.frame(Head=c(3500, 3600, 3700)), interval="confidence")

##        fit      lwr      upr
## 1 1247.576 1237.701 1257.451
## 2 1273.919 1264.610 1283.228
## 3 1300.262 1290.843 1309.681

Looking ahead …

So far we have just been looking at one quantitative predictor. In the next RPub on Multiple Regression, we will look at how the Age and Gender variables could be incorporated into our linear model.