Linear Regression

Introduction

\[y = m x + b\] is the equation of a straight line with slope \(m\) and \(y\)-intercept b.

x = rnorm(100)
y = 1.5 * x + 3 # no noise
dat=tibble(x,y)
ggplot(data=dat,aes(x = x, y = y)) +
  geom_point() +
  geom_smooth(method='lm',se=FALSE,color = 'grey') + 
  ggtitle("Notice the perfect fit to the Data, aslo noting the x variable is random") +
  theme_bw()

## `geom_smooth()` using formula 'y ~ x'

cor(x,y)

## [1] 1

Adding Noise

Unfortunately, the real world is never particularly free of error and uncertainty.

Let’s look at a more sophisticated model.

\[ Y = m X + b + \mathcal{N}(0,\sigma^2) \\\hat{Y} = \hat{m} X + \hat{b} \] Where \(\mathcal{N}(0,\sigma^2)\) denotes an error term distributed normally with mean zero, \(\mu = 0\), and variance \(\sigma^2\). The “hatted” variables are the estimated variables generated by the model.

x = rnorm(100)
sigma = 2
y = 1.5 * x + 3 + rnorm(n=100,sd=sigma)
dat=tibble(x,y)
ggplot(data=dat) +
  geom_point(aes(x = x, y = y)) +
  theme_bw()

What is going on here?

x = rnorm(100)
y = 1.5 * x + 3 + rnorm(100,sd=.5)
dat=tibble(x,y)
ggplot(data=dat) +
  geom_point(aes(x = x, y = y)) +
  geom_smooth(aes(x=x,y=y),method = 'lm') +
  theme_bw()

## `geom_smooth()` using formula 'y ~ x'

fit <- lm(y~x ,data = dat)
summary(fit)

## 
## Call:
## lm(formula = y ~ x, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1313 -0.3681 -0.0172  0.3481  1.1625 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.98110    0.04894   60.91   <2e-16 ***
## x            1.44249    0.04570   31.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4893 on 98 degrees of freedom
## Multiple R-squared:  0.9104, Adjusted R-squared:  0.9095 
## F-statistic: 996.2 on 1 and 98 DF,  p-value: < 2.2e-16

rz=residuals(fit)
pd = fitted(fit)
data2 = tibble(x,y,rz,pd)
data2

## # A tibble: 100 × 4
##          x     y     rz    pd
##      <dbl> <dbl>  <dbl> <dbl>
##  1  1.05   5.45   0.955 4.50 
##  2  0.954  4.57   0.213 4.36 
##  3 -0.0921 3.06   0.216 2.85 
##  4 -1.62   0.996  0.350 0.646
##  5  1.31   4.56  -0.311 4.87 
##  6  0.0413 3.34   0.300 3.04 
##  7  3.17   7.10  -0.451 7.56 
##  8 -0.398  2.67   0.259 2.41 
##  9 -0.216  2.38  -0.285 2.67 
## 10  2.12   5.80  -0.245 6.04 
## # … with 90 more rows

ggplot(data = data2,aes(x = x, y = y)) +
  geom_smooth(method = 'lm', se = FALSE, color='lightgrey') +
  geom_segment(aes(xend = x, yend = pd),alpha = .2) +
  geom_point() +
  geom_point(aes(y = pd),shape=1) +
  theme_bw()

## `geom_smooth()` using formula 'y ~ x'

Let’s look at some data with which you are already familiar.

mpg %>% ggplot(aes(x = displ, y = cty)) +
          geom_point() +
          geom_smooth(method = 'lm',se = FALSE,color = 'pink') +
          theme_bw()

## `geom_smooth()` using formula 'y ~ x'

d <- mpg %>% select(displ,cty)
fit = lm(cty ~ displ,data = d)
pred = predict(fit)
ggplot(d, aes(x = displ, y = cty)) +
  geom_smooth(method = 'lm', se = FALSE, color = 'pink') +
  geom_segment(aes(xend = displ, yend = pred), alpha = .2) +
  geom_point() +
  geom_point(aes(y = pred),shape = 1) +
  theme_bw()

## `geom_smooth()` using formula 'y ~ x'

summary(fit)

## 
## Call:
## lm(formula = cty ~ displ, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3109 -1.4695 -0.2566  1.1087 14.0064 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  25.9915     0.4821   53.91   <2e-16 ***
## displ        -2.6305     0.1302  -20.20   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.567 on 232 degrees of freedom
## Multiple R-squared:  0.6376, Adjusted R-squared:  0.6361 
## F-statistic: 408.2 on 1 and 232 DF,  p-value: < 2.2e-16

#install.packages("UsingR") #Just Run This onee

library(UsingR)

## Loading required package: MASS

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

## Loading required package: HistData

## Loading required package: Hmisc

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

## 
## Attaching package: 'UsingR'

## The following object is masked from 'package:survival':
## 
##     cancer

head(galton) #head just lists the first few rows of the data set

##   child parent
## 1  61.7   70.5
## 2  61.7   68.5
## 3  61.7   65.5
## 4  61.7   64.5
## 5  61.7   64.0
## 6  62.2   67.5

Assignment

Do your work a separate RMarkdown and post your results to RPubs. Send me the link by midnight Wednesday August 10.

This is a famous data set collected by a statistician named Francis Galton. He was one of the first people to use Linear Regression, and it is he who invented the term “regression to mean” in analyzing this data set.

The data are measurements of heights, in inches, of parents and their children, the heights are normalized to take gender into account.

The exercise is to determine if this data set gives us a predictive model for the height of the child of a parent who’s height is already known.

Please use this data set and the material in this presentation to answer the following questions:

What is the mathematical expression of such a model?
Using Galton’s data fit your model and print the summary data of the fit?
Using the summary data decide is your model is useful?
If you know your, parents heights, compute your parent height by:

\[ \text{parent} = \frac{\text{height of father in inches} + 1.08 \times \text{height of Mother in inches}}{2}\] 5. What is the difference between your height and your predicted height?

Linear Regression

Celia Evans, PhD

2022-08-04

Introduction

Adding Noise

Assignment