1 Simple Linear Regression

I am working through the material in the Penn State online class STAT 501 Regression Methods. This R script is a personal exercise to translate the concepts and examples illustrated in Minitab into R and SAS. Lesson 1 in STAT 501 is Simple Linear Regression. The SAS version of the code below is posted on my blog.

Overview

This lesson primarily introduces the simple linear regression (SLR) model. The four conditions for inferences from the SLR model are that the errors \(e_i\) are 1) independent 2) normally distributed random variables with 3) mean zero (linearity) and 4) constant variance \(\sigma^2\).

This lesson introduces ANOVA. The coefficient of determination, \(r^2\) (lower case r when referring to SLR) is the ratio of the sum of squared residuals over the total sum of squares \(r^2=SSR/SST\). The Pearson correlation coefficient r is simply the square root of \(r^2\) with sign equal to that of the slope parameter \(\beta_1\).

This lesson explaines the hypothesis tests (t-test of \(H_0: \beta_1=0\), or ANOVA F-test for \(H_0: \beta_1=0\)) of the population correlation \(\rho\) when it is not obvious which variable should be regarded as the response.

R Concepts Worth Noting

Read a fixed-width file using the read_fwf function from the readr package.
Read tab-delimined file using the read.table function.
The lm summary reports the coefficient of determination \(r^2\), but not the Pearson correlation coefficient \(r\). You can either take the square root, or calculate it directly with the cor function. Even better, use the cor.test function to get a confidence interval along with \(r\).

Data Management

This lesson uses several data files. I’ll load them right away in this one section to not clutter the code later, and because the lessons often refer to data files multiple times.

library(readr)
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# skincancer is a fixed-width data file.
# Mortality due to skin cancer (number of deaths per 10 million people), and 
# U.S. state latitude and longitude at center of state for n=49 states.
skincancer <- read_fwf(
  file="https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/skincancer.txt", 
  skip = 1,
  fwf_widths(c(25, 9, 7, 5, 5), c('state','lat','mort','ocean','lon')))

## Parsed with column specification:
## cols(
##   state = col_character(),
##   lat = col_double(),
##   mort = col_integer(),
##   ocean = col_integer(),
##   lon = col_double()
## )

#head(skincancer)
#summarise(skincancer, n=n())

# student_height_weight is a tab-delimited data file.
# Height (inches) and weight (pounds) for n=10 students. 
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/student_height_weight.txt'
student <- read.table(url, sep = '\t', header = TRUE)
#head(student)
#summarise(student, n=n())

# bldgstories is a tab-delimited data file.
# Buiding height (feet) and number of stories for n=60 buildings.
# In this case, the header includes a space character that confuses R.  Work 
# around the issue by skipping the first line, then manually setting the 
# column headers using colnames.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/bldgstories.txt'
bldgstories <- read.table(url, sep = '\t', skip = 1)
colnames(bldgstories) <- c('YEAR', 'HGHT', 'STORIES')
#head(bldgstories)
#summarise(bldgstories, n=n())

# signdist is a tab-delimited data file.
# Driver age (years) and maximum distance (feet) at which the driver can read 
# a road sign for n=30 drivers.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/examples/signdist.txt'
signdist <- read.table(url, sep = '\t', header = TRUE)
#head(signdist)
#summarise(signdist, n=n())

# heightgpa is a tab-delimited data file.
# Student height (inches) and academic performance (gpa) for n=35 students.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/heightgpa.txt'
heightgpa <- read.table(url, sep = '\t', header = TRUE)
#head(heightgpa)
#summarise(heightgpa, n=n())

# husbandwife is a tab-delimited data file.
# Husband and wife age and height for n=218 couples.
# Note the "*" in data represents null.  transform the variables from 
# character to numerics.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/husbandwife.txt'
husbandwife <- read.table(url, sep = '\t', header = TRUE)
husbandwife[husbandwife == "*"] <- NA
husbandwife <- transform(husbandwife, HAge = as.numeric(HAge))
husbandwife <- transform(husbandwife, HHght = as.numeric(HHght))
husbandwife <- transform(husbandwife, WAge = as.numeric(WAge))
husbandwife <- transform(husbandwife, WHght = as.numeric(WHght))
husbandwife <- transform(husbandwife, HAgeMar = as.numeric(HAgeMar))
head(husbandwife)

##   HAge HHght WAge WHght HAgeMar
## 1   30   100   24    27      11
## 2    6   109   10    22       5
## 3   21    22   12    33      23
## 4   33    82   38    16      12
## 5   39     9   33     3      16
## 6   13    38    9    41       9

summarise(husbandwife, n=n())

##     n
## 1 218

# poverty is a tab-delimited data file.
# Poverty rate, teen birth rate for n=51 U.S. states (including D.C.).
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/examples/poverty.txt'
poverty <- read.table(url, sep = '\t', header = TRUE)
head(poverty)

##     Location PovPct Brth15to17 Brth18to19 ViolCrime TeenBrth
## 1    Alabama   20.1       31.5       88.7      11.2     54.5
## 2     Alaska    7.1       18.9       73.7       9.1     39.5
## 3    Arizona   16.1       35.0      102.5      10.4     61.2
## 4   Arkansas   14.9       31.6      101.7      10.4     59.9
## 5 California   16.7       22.6       69.1      11.2     41.1
## 6   Colorado    8.8       26.2       79.1       5.8     47.0

summarise(poverty, n=n())

##    n
## 1 51

1.1 What is Simple Linear Regression?

Simple linear regression (SLR) summarizes the relationship between two quantitative variables.

Here is an example of a statistical relationship. Data set skincancer records one observation per U.S. state (n=49). The response variable mort is the mortality due to skin cancer (number of deaths per 10 million people) and the explanatory variable lat is the latitude (degrees North) at the center of each of 49 states in the U.S.

ggplot(skincancer, aes(y=mort, x=lat)) +
  geom_point() +
  geom_smooth(method = "lm") +
  ylab("Mortality (Deaths per 10 million)") +
  xlab("Latitude (at center of state)") +
  ggtitle("Skin Cancer Mortality versus State Latitude")

1.2 What is the Best Fitting Line?

The difference between the actual value and the predicted value is the residual \(e_i=y_i-\hat{y_i}\). Ordinary Least Squares (OLS) minimizes the sum of squared residuals. The resulting least squares line passes through the point \((\bar{x},\bar{y})\).

In the skincancer example, the fitted line was mort ~ lat. The parameter estimator for lat is -5.9776, meaning one additional degree of northerly latitude is associated with 5.9776 less deaths per 10 million of population.

summary(lm(mort ~ lat, data = skincancer))

## 
## Call:
## lm(formula = mort ~ lat, data = skincancer)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.972 -13.185   0.972  12.006  43.938 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 389.1894    23.8123   16.34  < 2e-16 ***
## lat          -5.9776     0.5984   -9.99 3.31e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.12 on 47 degrees of freedom
## Multiple R-squared:  0.6798, Adjusted R-squared:  0.673 
## F-statistic:  99.8 on 1 and 47 DF,  p-value: 3.309e-13

Consider another example, the relationship between weight and height. Data set student_height_weight records one observation per student (n=10). The response variable ht is the student height (inches) and the explanatory variable wt is the student weight (pounds).

ggplot(student, aes(y=wt, x=ht)) +
  geom_point() +
  geom_smooth(method = "lm") +
  ylab("Weight (lbs)") +
  xlab("Height (inches)") +
  ggtitle("Weight versus Height")

In this example, the fitted line is wt ~ ht. The parameter estimator for ht is 6.1376, meaning each additional inch of height is associated with 6.14 additional pounds of weight among the population of students targeted in the sample.

summary(lm(wt ~ ht, data = student))

## 
## Call:
## lm(formula = wt ~ ht, data = student)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.2339  -4.0804  -0.0963   4.6445  14.2158 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -266.5344    51.0320  -5.223    8e-04 ***
## ht             6.1376     0.7353   8.347 3.21e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.641 on 8 degrees of freedom
## Multiple R-squared:  0.897,  Adjusted R-squared:  0.8841 
## F-statistic: 69.67 on 1 and 8 DF,  p-value: 3.214e-05

1.3 The Simple Linear Regression Model

Inferences from the simple linear regression model are valid when the following conditions hold: the errors \(e_i\) are 1) independent 2) normally distributed random variables with 3) mean zero (linearity) and 4) constant variance \(\sigma^2\). The mean zero (linearity) means the mean error at each explanatory variable level is zero.

1.4 What is the Common Error Variance?

The common error variance \(\sigma^2\) is estimated by the mean squared error \(MSE=\sum(y_i-\hat{y_i})^2 / (n-2)\) where the numerator is the sum of squared errors (SSE). The n-2 is the degrees of freedom (2 for the intercept + explanatory variable).

In the student_height_weight example above, MSE=\(8.641^2\)=74.67.

1.5 The Coefficient of Determination, r-squared

SSE is part of three contrasting sums of squares. Two others are the regression sum of squares \(SSR=\sum(\hat{y}_i-\bar{y})^2\) and the total sum of squares \(SST=\sum(y_i-\bar{y})^2\). The \(r^2\) coefficient of determination, is the ratio \(SSR/SST\). \(r^2\) is the fraction of the variation in the response variable explained by the response variable. Note that if \(r^2=1\) then the explanatory variable accounts for all of the variation in the response variable, and if \(r^2=0\) then the explanatory variable accounts for none of the variation in the response variable - a horizontal regression line.

In the skin cancer example above, \(r^2=.6798\). We can say that 68% of the variation in the skin cancer mortality rate is reduced by taking into account latitude. Or we can say 68% of the variation in skin cancer mortality is ‘explained by’ latitude.

1.6 (Pearson) Correlation Coefficient r

The Pearson correlation coefficient r is simply the square root of \(r^2\) with sign equal to that of the slope parameter \(\beta_1\). r takes values between -1 (perfect negative linear relationship) and +1 (perfect positive linear relationship). r=0 indicates no linear relationship.

The Pearson correlation coeficient in the skin cancer example is \(\sqrt{.6798}=-.825\). The Pearson correlation coeficient in the student_height_weight example is \(\sqrt{.897}=.947\).

cor(skincancer$mort, skincancer$lat)

## [1] -0.8245178

cor(student$wt, student$ht)

## [1] 0.9470984

1.7 Some Examples

Consider another example, the relationship between building height and the number of stories. Data set bldgstories records one observation per building (n=60). The response variable HGHT is the building height (feet) and the explanatory variable STORIES is the number of floors in the building.

The Pearson correlation coeficient r=.951 indicates a strong linear relationship.

ggplot(bldgstories, aes(y=HGHT, x=STORIES)) +
  geom_point() +
  geom_smooth(method = "lm") +
  ylab("Height") +
  xlab("Stories") +
  ggtitle("Height versus Stories")

cor(bldgstories$HGHT, bldgstories$STORIES)

## [1] 0.9505549

Consider another example, the relationship between driver age and the distance at which the driver can read a road sign. Data set signdist records one observation per driver (n=30). The response variable Distance is the maximum distance (feet) at which the driver can still read a road sign height and the explanatory variable Age is the driver’s age (years).

The Pearson correlation coeficient r=-.801 indicates a fairly strong negative linear relationship.

ggplot(signdist, aes(y=Distance, x=Age)) +
  geom_point() +
  geom_smooth(method = "lm") +
  ylab("Distance") +
  xlab("Driver's Age") +
  ggtitle("Distance versus Driver's Age")

cor(signdist$Distance, signdist$Age)

## [1] -0.8012447

Consider another example, the relationship between student height and academic performance grade point average. Data set heightgpa records one observation per student (n=35). The response variable gpa measures academic performance and the explanatory variable height is the student’s height (inches).

The Pearson correlation coeficient r=-0.053 indicates no linear relationship (as expected!).

ggplot(heightgpa, aes(y=gpa, x=height)) +
  geom_point() +
  geom_smooth(method = "lm") +
  ylab("Grade Point Averages") +
  xlab("Height (inches)") +
  ggtitle("Grade Point Average vs Height")

cor(heightgpa$gpa, heightgpa$height)

## [1] -0.05324126

1.8 R-squared Cautions

The coefficient of determination \(r^2\) and the correlation coefficient \(r\) quantify the strength of a linear relationship. \(r^2\) and \(r\) could be zero, and yet a perfect curvilinear relationship exists.
A large \(r^2\) does not mean the estimated regression line fits the data well. A nonlinear relationship might fit better.
\(r^2\) is sensitive outliers and leverage points.
Correlation does not imply causation!
Ecological correlations, correlations based on rates or averages, tend to overstate the strength of an association.
A statistically significant \(r^2\) does not imply that the slope parameter is meaningfully different from 0.
A large \(r^2\) does not necessarily imply a useful prediction of the response \(y_{new}\), or estimation of the mean response \(µ_Y\), can be made. Prediction intervals or confidence intervals may still be too wide to be useful.

1.9 Hypothesis Test for the Population Correlation Coefficient

Conduct a hypothesis test (t-test of \(H_0: \beta_1=0\), or ANOVA F-test for \(H_0: \beta_1=0\)) for the population correlation \(\rho\) when it is not obvious which variable should be regarded as the response.

Conduct a t-test on \(\rho\). The test statistic is \[t^*=r\sqrt{(n-2)}/\sqrt{(1-r^2)}\].

For example, the relationship between the husband’s age and wife’s age does not have a clear explanatory and response varaible. Data set husbandwife records one observation per married couple (n=218). The response variable can be wife age WAge or husband age HAge while the explanatory variable is the other. The Pearson correlation coefficient is .937681 (the same regardless of the order of arguments). For the t-test \(n=170\) and \(r=.939\). \(t^*=34.975\) with \(p<.0001\), so reject \(H_0: \rho=0\).

w<-husbandwife$WAge
h<-husbandwife$HAge
a<-data.frame(age1=w, age2=h, grp="wife age vs husband age")
df<-rbind(data.frame(age1=w, age2=h, grp="wife age vs husband age"),
          data.frame(age1=h, age2=w, grp="husband age vs wife age"))
ggplot(na.omit(df), aes(y=age1, x=age2, group=grp, shape=grp, colour=grp)) +
  geom_point() +
  geom_smooth(method = "lm") +
  ylab("Age 1") +
  xlab("Age 2") +
  ggtitle("Spouse Age vs Spouse Age")

cor(husbandwife$HAge, husbandwife$WAge, use="pairwise.complete.obs")

## [1] 0.937681

cor(husbandwife$WAge, husbandwife$HAge, use="pairwise.complete.obs")

## [1] 0.937681

cor.test(husbandwife$WAge, husbandwife$HAge, alternative = "two.sided")

## 
##  Pearson's product-moment correlation
## 
## data:  husbandwife$WAge and husbandwife$HAge
## t = 34.975, df = 168, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9165191 0.9536083
## sample estimates:
##      cor 
## 0.937681

1.10 Further Examples

Consider another example, the relationship between teen birth rate, ages 15-17 (Brth15to17) and poverty rate (PovPct). Data set poverty records the poverty rate PovPct and tean birth rate for ages 15-17 Brth15to17 for each U.S. state (n=51).

ggplot(poverty, aes(y=Brth15to17, x=PovPct)) +
  geom_point() +
  geom_smooth(method = "lm") +
  ylab("Poverty Rate (%)") +
  xlab("Teen Birth Rate (ages 15-17) (%)") +
  ggtitle("Teen Birth (15-17) Rate versus Poverty Rate")

m <- lm(Brth15to17 ~ PovPct, data = poverty)
summary(m)

## 
## Call:
## lm(formula = Brth15to17 ~ PovPct, data = poverty)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.2275  -3.6554  -0.0407   2.4972  10.5152 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.2673     2.5297   1.687    0.098 .  
## PovPct        1.3733     0.1835   7.483 1.19e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.551 on 49 degrees of freedom
## Multiple R-squared:  0.5333, Adjusted R-squared:  0.5238 
## F-statistic:    56 on 1 and 49 DF,  p-value: 1.188e-09

The slope \((PovPct=1.3733)\) means the 15 to 17 year old birth rate increases 1.3733 percent on average for each percent increase in the poverty rate.

The intercept \((4.2673)\) means if the poverty rate was 0% the predicted average for the 15 to 17 year old birth rate would be 4.267. Since no states have a 0% poverty rate this interpretation is not meaningful. (perhaps the variable should be centered)

The residual standard error (\(s=5.55057\)) roughly indicates the average difference between the predicted and observed values.

The r-squared (\(r^2=.5333\)) means poverty rates “explain” 53.33% of the observed variation in the 15 to 17 year old average birth rates of the states.

Consider another example, the relationship between forced exhalation volume (FEV), a measure of how much air one can forcibly exhale, and age (in years).

# fev_dat is a fixed-format data file.
fev_dat <- read_fwf(
  file="https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/examples/fev_dat.txt", 
  skip = 1,
  fwf_widths(c(4, 7, 7, 6, 6)))

## Parsed with column specification:
## cols(
##   X1 = col_integer(),
##   X2 = col_double(),
##   X3 = col_double(),
##   X4 = col_integer(),
##   X5 = col_integer()
## )

## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)

## Warning: 654 parsing failures.
## row # A tibble: 5 x 5 col     row col   expected actual file                                         expected   <int> <chr> <chr>    <chr>  <chr>                                        actual 1     1 X5    6 chars  1      'https://onlinecourses.science.psu.edu/stat~ file 2     2 X5    6 chars  1      'https://onlinecourses.science.psu.edu/stat~ row 3     3 X5    6 chars  1      'https://onlinecourses.science.psu.edu/stat~ col 4     4 X5    6 chars  1      'https://onlinecourses.science.psu.edu/stat~ expected 5     5 X5    6 chars  1      'https://onlinecourses.science.psu.edu/stat~
## ... ................. ... .......................................................................... ........ .......................................................................... ...... .......................................................................... .... .......................................................................... ... .......................................................................... ... .......................................................................... ........ ..........................................................................
## See problems(...) for more details.

colnames(fev_dat) <- c('age','FEV','ht','sex','smoke')
fev_dat <- fev_dat %>% filter(age>=6, age<=10)
head(fev_dat)

## # A tibble: 6 x 5
##     age   FEV    ht   sex smoke
##   <int> <dbl> <dbl> <int> <int>
## 1     9  1.71  57.0     0     0
## 2     8  1.72  67.5     0     0
## 3     7  1.72  54.5     0     0
## 4     9  1.56  53.0     1     0
## 5     9  1.90  57.0     1     0
## 6     8  2.34  61.0     0     0

ggplot(fev_dat, aes(y=FEV, x=age)) +
  geom_point() +
  geom_smooth(method = "lm") +
  ylab("FEV") +
  xlab("Age (years)") +
  ggtitle("FEV versus Age")

m <- lm(FEV ~ age, data = fev_dat)
summary(m)

## 
## Call:
## lm(formula = FEV ~ age, data = fev_dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.22576 -0.28855 -0.00534  0.27106  1.90724 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.01165    0.15237   0.076    0.939    
## age          0.26721    0.01801  14.839   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4312 on 349 degrees of freedom
## Multiple R-squared:  0.3869, Adjusted R-squared:  0.3851 
## F-statistic: 220.2 on 1 and 349 DF,  p-value: < 2.2e-16

An important feature of this example is that the variance of individual y-values from the regression line increases as age increases.