I am working through the material in the Penn State online class STAT 501 Regression Methods. This R script is a personal exercise to translate the concepts and examples illustrated in Minitab into R and SAS. Lesson 1 in STAT 501 is Simple Linear Regression. The SAS version of the code below is posted on my blog.
This lesson primarily introduces the simple linear regression (SLR) model. The four conditions for inferences from the SLR model are that the errors \(e_i\) are 1) independent 2) normally distributed random variables with 3) mean zero (linearity) and 4) constant variance \(\sigma^2\).
This lesson introduces ANOVA. The coefficient of determination, \(r^2\) (lower case r when referring to SLR) is the ratio of the sum of squared residuals over the total sum of squares \(r^2=SSR/SST\). The Pearson correlation coefficient r is simply the square root of \(r^2\) with sign equal to that of the slope parameter \(\beta_1\).
This lesson explaines the hypothesis tests (t-test of \(H_0: \beta_1=0\), or ANOVA F-test for \(H_0: \beta_1=0\)) of the population correlation \(\rho\) when it is not obvious which variable should be regarded as the response.
read_fwf
function from the readr
package.
read.table
function.
lm
summary reports the coefficient of determination \(r^2\), but not the Pearson correlation coefficient \(r\). You can either take the square root, or calculate it directly with the cor
function. Even better, use the cor.test
function to get a confidence interval along with \(r\).
This lesson uses several data files. I’ll load them right away in this one section to not clutter the code later, and because the lessons often refer to data files multiple times.
library(readr)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# skincancer is a fixed-width data file.
# Mortality due to skin cancer (number of deaths per 10 million people), and
# U.S. state latitude and longitude at center of state for n=49 states.
skincancer <- read_fwf(
file="https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/skincancer.txt",
skip = 1,
fwf_widths(c(25, 9, 7, 5, 5), c('state','lat','mort','ocean','lon')))
## Parsed with column specification:
## cols(
## state = col_character(),
## lat = col_double(),
## mort = col_integer(),
## ocean = col_integer(),
## lon = col_double()
## )
#head(skincancer)
#summarise(skincancer, n=n())
# student_height_weight is a tab-delimited data file.
# Height (inches) and weight (pounds) for n=10 students.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/student_height_weight.txt'
student <- read.table(url, sep = '\t', header = TRUE)
#head(student)
#summarise(student, n=n())
# bldgstories is a tab-delimited data file.
# Buiding height (feet) and number of stories for n=60 buildings.
# In this case, the header includes a space character that confuses R. Work
# around the issue by skipping the first line, then manually setting the
# column headers using colnames.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/bldgstories.txt'
bldgstories <- read.table(url, sep = '\t', skip = 1)
colnames(bldgstories) <- c('YEAR', 'HGHT', 'STORIES')
#head(bldgstories)
#summarise(bldgstories, n=n())
# signdist is a tab-delimited data file.
# Driver age (years) and maximum distance (feet) at which the driver can read
# a road sign for n=30 drivers.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/examples/signdist.txt'
signdist <- read.table(url, sep = '\t', header = TRUE)
#head(signdist)
#summarise(signdist, n=n())
# heightgpa is a tab-delimited data file.
# Student height (inches) and academic performance (gpa) for n=35 students.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/heightgpa.txt'
heightgpa <- read.table(url, sep = '\t', header = TRUE)
#head(heightgpa)
#summarise(heightgpa, n=n())
# husbandwife is a tab-delimited data file.
# Husband and wife age and height for n=218 couples.
# Note the "*" in data represents null. transform the variables from
# character to numerics.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/husbandwife.txt'
husbandwife <- read.table(url, sep = '\t', header = TRUE)
husbandwife[husbandwife == "*"] <- NA
husbandwife <- transform(husbandwife, HAge = as.numeric(HAge))
husbandwife <- transform(husbandwife, HHght = as.numeric(HHght))
husbandwife <- transform(husbandwife, WAge = as.numeric(WAge))
husbandwife <- transform(husbandwife, WHght = as.numeric(WHght))
husbandwife <- transform(husbandwife, HAgeMar = as.numeric(HAgeMar))
head(husbandwife)
## HAge HHght WAge WHght HAgeMar
## 1 30 100 24 27 11
## 2 6 109 10 22 5
## 3 21 22 12 33 23
## 4 33 82 38 16 12
## 5 39 9 33 3 16
## 6 13 38 9 41 9
summarise(husbandwife, n=n())
## n
## 1 218
# poverty is a tab-delimited data file.
# Poverty rate, teen birth rate for n=51 U.S. states (including D.C.).
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/examples/poverty.txt'
poverty <- read.table(url, sep = '\t', header = TRUE)
head(poverty)
## Location PovPct Brth15to17 Brth18to19 ViolCrime TeenBrth
## 1 Alabama 20.1 31.5 88.7 11.2 54.5
## 2 Alaska 7.1 18.9 73.7 9.1 39.5
## 3 Arizona 16.1 35.0 102.5 10.4 61.2
## 4 Arkansas 14.9 31.6 101.7 10.4 59.9
## 5 California 16.7 22.6 69.1 11.2 41.1
## 6 Colorado 8.8 26.2 79.1 5.8 47.0
summarise(poverty, n=n())
## n
## 1 51
Simple linear regression (SLR) summarizes the relationship between two quantitative variables.
Here is an example of a statistical relationship. Data set skincancer
records one observation per U.S. state (n=49). The response variable mort
is the mortality due to skin cancer (number of deaths per 10 million people) and the explanatory variable lat
is the latitude (degrees North) at the center of each of 49 states in the U.S.
ggplot(skincancer, aes(y=mort, x=lat)) +
geom_point() +
geom_smooth(method = "lm") +
ylab("Mortality (Deaths per 10 million)") +
xlab("Latitude (at center of state)") +
ggtitle("Skin Cancer Mortality versus State Latitude")
The difference between the actual value and the predicted value is the residual \(e_i=y_i-\hat{y_i}\). Ordinary Least Squares (OLS) minimizes the sum of squared residuals. The resulting least squares line passes through the point \((\bar{x},\bar{y})\).
In the skincancer
example, the fitted line was mort ~ lat
. The parameter estimator for lat
is -5.9776, meaning one additional degree of northerly latitude is associated with 5.9776 less deaths per 10 million of population.
summary(lm(mort ~ lat, data = skincancer))
##
## Call:
## lm(formula = mort ~ lat, data = skincancer)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.972 -13.185 0.972 12.006 43.938
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 389.1894 23.8123 16.34 < 2e-16 ***
## lat -5.9776 0.5984 -9.99 3.31e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.12 on 47 degrees of freedom
## Multiple R-squared: 0.6798, Adjusted R-squared: 0.673
## F-statistic: 99.8 on 1 and 47 DF, p-value: 3.309e-13
Consider another example, the relationship between weight and height. Data set student_height_weight
records one observation per student (n=10). The response variable ht
is the student height (inches) and the explanatory variable wt
is the student weight (pounds).
ggplot(student, aes(y=wt, x=ht)) +
geom_point() +
geom_smooth(method = "lm") +
ylab("Weight (lbs)") +
xlab("Height (inches)") +
ggtitle("Weight versus Height")
In this example, the fitted line is wt ~ ht
. The parameter estimator for ht
is 6.1376, meaning each additional inch of height is associated with 6.14 additional pounds of weight among the population of students targeted in the sample.
summary(lm(wt ~ ht, data = student))
##
## Call:
## lm(formula = wt ~ ht, data = student)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.2339 -4.0804 -0.0963 4.6445 14.2158
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -266.5344 51.0320 -5.223 8e-04 ***
## ht 6.1376 0.7353 8.347 3.21e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.641 on 8 degrees of freedom
## Multiple R-squared: 0.897, Adjusted R-squared: 0.8841
## F-statistic: 69.67 on 1 and 8 DF, p-value: 3.214e-05
Inferences from the simple linear regression model are valid when the following conditions hold: the errors \(e_i\) are 1) independent 2) normally distributed random variables with 3) mean zero (linearity) and 4) constant variance \(\sigma^2\). The mean zero (linearity) means the mean error at each explanatory variable level is zero.
The common error variance \(\sigma^2\) is estimated by the mean squared error \(MSE=\sum(y_i-\hat{y_i})^2 / (n-2)\) where the numerator is the sum of squared errors (SSE). The n-2 is the degrees of freedom (2 for the intercept + explanatory variable).
In the student_height_weight
example above, MSE=\(8.641^2\)=74.67.
SSE is part of three contrasting sums of squares. Two others are the regression sum of squares \(SSR=\sum(\hat{y}_i-\bar{y})^2\) and the total sum of squares \(SST=\sum(y_i-\bar{y})^2\). The \(r^2\) coefficient of determination, is the ratio \(SSR/SST\). \(r^2\) is the fraction of the variation in the response variable explained by the response variable. Note that if \(r^2=1\) then the explanatory variable accounts for all of the variation in the response variable, and if \(r^2=0\) then the explanatory variable accounts for none of the variation in the response variable - a horizontal regression line.
In the skin cancer
example above, \(r^2=.6798\). We can say that 68% of the variation in the skin cancer mortality rate is reduced by taking into account latitude. Or we can say 68% of the variation in skin cancer mortality is ‘explained by’ latitude.
The Pearson correlation coefficient r is simply the square root of \(r^2\) with sign equal to that of the slope parameter \(\beta_1\). r takes values between -1 (perfect negative linear relationship) and +1 (perfect positive linear relationship). r=0 indicates no linear relationship.
The Pearson correlation coeficient in the skin cancer
example is \(\sqrt{.6798}=-.825\). The Pearson correlation coeficient in the student_height_weight
example is \(\sqrt{.897}=.947\).
cor(skincancer$mort, skincancer$lat)
## [1] -0.8245178
cor(student$wt, student$ht)
## [1] 0.9470984
Consider another example, the relationship between building height and the number of stories. Data set bldgstories
records one observation per building (n=60). The response variable HGHT
is the building height (feet) and the explanatory variable STORIES
is the number of floors in the building.
The Pearson correlation coeficient r=.951 indicates a strong linear relationship.
ggplot(bldgstories, aes(y=HGHT, x=STORIES)) +
geom_point() +
geom_smooth(method = "lm") +
ylab("Height") +
xlab("Stories") +
ggtitle("Height versus Stories")
cor(bldgstories$HGHT, bldgstories$STORIES)
## [1] 0.9505549
Consider another example, the relationship between driver age and the distance at which the driver can read a road sign. Data set signdist
records one observation per driver (n=30). The response variable Distance
is the maximum distance (feet) at which the driver can still read a road sign height and the explanatory variable Age
is the driver’s age (years).
The Pearson correlation coeficient r=-.801 indicates a fairly strong negative linear relationship.
ggplot(signdist, aes(y=Distance, x=Age)) +
geom_point() +
geom_smooth(method = "lm") +
ylab("Distance") +
xlab("Driver's Age") +
ggtitle("Distance versus Driver's Age")
cor(signdist$Distance, signdist$Age)
## [1] -0.8012447
Consider another example, the relationship between student height and academic performance grade point average. Data set heightgpa
records one observation per student (n=35). The response variable gpa
measures academic performance and the explanatory variable height
is the student’s height (inches).
The Pearson correlation coeficient r=-0.053 indicates no linear relationship (as expected!).
ggplot(heightgpa, aes(y=gpa, x=height)) +
geom_point() +
geom_smooth(method = "lm") +
ylab("Grade Point Averages") +
xlab("Height (inches)") +
ggtitle("Grade Point Average vs Height")
cor(heightgpa$gpa, heightgpa$height)
## [1] -0.05324126
The coefficient of determination \(r^2\) and the correlation coefficient \(r\) quantify the strength of a linear relationship. \(r^2\) and \(r\) could be zero, and yet a perfect curvilinear relationship exists.
A large \(r^2\) does not mean the estimated regression line fits the data well. A nonlinear relationship might fit better.
\(r^2\) is sensitive outliers and leverage points.
Correlation does not imply causation!
Ecological correlations, correlations based on rates or averages, tend to overstate the strength of an association.
A statistically significant \(r^2\) does not imply that the slope parameter is meaningfully different from 0.
A large \(r^2\) does not necessarily imply a useful prediction of the response \(y_{new}\), or estimation of the mean response \(µ_Y\), can be made. Prediction intervals or confidence intervals may still be too wide to be useful.
Conduct a hypothesis test (t-test of \(H_0: \beta_1=0\), or ANOVA F-test for \(H_0: \beta_1=0\)) for the population correlation \(\rho\) when it is not obvious which variable should be regarded as the response.
Conduct a t-test on \(\rho\). The test statistic is \[t^*=r\sqrt{(n-2)}/\sqrt{(1-r^2)}\].
For example, the relationship between the husband’s age and wife’s age does not have a clear explanatory and response varaible. Data set husbandwife
records one observation per married couple (n=218). The response variable can be wife age WAge
or husband age HAge
while the explanatory variable is the other. The Pearson correlation coefficient is .937681 (the same regardless of the order of arguments). For the t-test \(n=170\) and \(r=.939\). \(t^*=34.975\) with \(p<.0001\), so reject \(H_0: \rho=0\).
w<-husbandwife$WAge
h<-husbandwife$HAge
a<-data.frame(age1=w, age2=h, grp="wife age vs husband age")
df<-rbind(data.frame(age1=w, age2=h, grp="wife age vs husband age"),
data.frame(age1=h, age2=w, grp="husband age vs wife age"))
ggplot(na.omit(df), aes(y=age1, x=age2, group=grp, shape=grp, colour=grp)) +
geom_point() +
geom_smooth(method = "lm") +
ylab("Age 1") +
xlab("Age 2") +
ggtitle("Spouse Age vs Spouse Age")
cor(husbandwife$HAge, husbandwife$WAge, use="pairwise.complete.obs")
## [1] 0.937681
cor(husbandwife$WAge, husbandwife$HAge, use="pairwise.complete.obs")
## [1] 0.937681
cor.test(husbandwife$WAge, husbandwife$HAge, alternative = "two.sided")
##
## Pearson's product-moment correlation
##
## data: husbandwife$WAge and husbandwife$HAge
## t = 34.975, df = 168, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9165191 0.9536083
## sample estimates:
## cor
## 0.937681
Consider another example, the relationship between teen birth rate, ages 15-17 (Brth15to17) and poverty rate (PovPct). Data set poverty
records the poverty rate PovPct
and tean birth rate for ages 15-17 Brth15to17
for each U.S. state (n=51).
ggplot(poverty, aes(y=Brth15to17, x=PovPct)) +
geom_point() +
geom_smooth(method = "lm") +
ylab("Poverty Rate (%)") +
xlab("Teen Birth Rate (ages 15-17) (%)") +
ggtitle("Teen Birth (15-17) Rate versus Poverty Rate")
m <- lm(Brth15to17 ~ PovPct, data = poverty)
summary(m)
##
## Call:
## lm(formula = Brth15to17 ~ PovPct, data = poverty)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.2275 -3.6554 -0.0407 2.4972 10.5152
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.2673 2.5297 1.687 0.098 .
## PovPct 1.3733 0.1835 7.483 1.19e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.551 on 49 degrees of freedom
## Multiple R-squared: 0.5333, Adjusted R-squared: 0.5238
## F-statistic: 56 on 1 and 49 DF, p-value: 1.188e-09
The slope \((PovPct=1.3733)\) means the 15 to 17 year old birth rate increases 1.3733 percent on average for each percent increase in the poverty rate.
The intercept \((4.2673)\) means if the poverty rate was 0% the predicted average for the 15 to 17 year old birth rate would be 4.267. Since no states have a 0% poverty rate this interpretation is not meaningful. (perhaps the variable should be centered)
The residual standard error (\(s=5.55057\)) roughly indicates the average difference between the predicted and observed values.
The r-squared (\(r^2=.5333\)) means poverty rates “explain” 53.33% of the observed variation in the 15 to 17 year old average birth rates of the states.
Consider another example, the relationship between forced exhalation volume (FEV), a measure of how much air one can forcibly exhale, and age (in years).
# fev_dat is a fixed-format data file.
fev_dat <- read_fwf(
file="https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/examples/fev_dat.txt",
skip = 1,
fwf_widths(c(4, 7, 7, 6, 6)))
## Parsed with column specification:
## cols(
## X1 = col_integer(),
## X2 = col_double(),
## X3 = col_double(),
## X4 = col_integer(),
## X5 = col_integer()
## )
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 654 parsing failures.
## row # A tibble: 5 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 1 X5 6 chars 1 'https://onlinecourses.science.psu.edu/stat~ file 2 2 X5 6 chars 1 'https://onlinecourses.science.psu.edu/stat~ row 3 3 X5 6 chars 1 'https://onlinecourses.science.psu.edu/stat~ col 4 4 X5 6 chars 1 'https://onlinecourses.science.psu.edu/stat~ expected 5 5 X5 6 chars 1 'https://onlinecourses.science.psu.edu/stat~
## ... ................. ... .......................................................................... ........ .......................................................................... ...... .......................................................................... .... .......................................................................... ... .......................................................................... ... .......................................................................... ........ ..........................................................................
## See problems(...) for more details.
colnames(fev_dat) <- c('age','FEV','ht','sex','smoke')
fev_dat <- fev_dat %>% filter(age>=6, age<=10)
head(fev_dat)
## # A tibble: 6 x 5
## age FEV ht sex smoke
## <int> <dbl> <dbl> <int> <int>
## 1 9 1.71 57.0 0 0
## 2 8 1.72 67.5 0 0
## 3 7 1.72 54.5 0 0
## 4 9 1.56 53.0 1 0
## 5 9 1.90 57.0 1 0
## 6 8 2.34 61.0 0 0
ggplot(fev_dat, aes(y=FEV, x=age)) +
geom_point() +
geom_smooth(method = "lm") +
ylab("FEV") +
xlab("Age (years)") +
ggtitle("FEV versus Age")
m <- lm(FEV ~ age, data = fev_dat)
summary(m)
##
## Call:
## lm(formula = FEV ~ age, data = fev_dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.22576 -0.28855 -0.00534 0.27106 1.90724
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.01165 0.15237 0.076 0.939
## age 0.26721 0.01801 14.839 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4312 on 349 degrees of freedom
## Multiple R-squared: 0.3869, Adjusted R-squared: 0.3851
## F-statistic: 220.2 on 1 and 349 DF, p-value: < 2.2e-16
An important feature of this example is that the variance of individual y-values from the regression line increases as age increases.