Regression Part 2

Celia Evans

Introduction

Example Data Set

Diamond

## # A tibble: 48 x 2
##    carat price
##    <dbl> <dbl>
##  1  0.17   355
##  2  0.16   328
##  3  0.17   350
##  4  0.18   325
##  5  0.25   642
##  6  0.16   342
##  7  0.15   322
##  8  0.19   485
##  9  0.21   483
## 10  0.15   323
## # … with 38 more rows

Using lm

fit <- lm(price ~ carat,data = diamond)
m = fit$coef[[2]]
b = fit$coef[[1]]
c(m,b)
## [1] 3721.0249 -259.6259

So how hard is it to calculate m and b?

Empirical Statistics

Some definitions from elementary statistics.

Let \(X = \{x_1,x_2,x_3,\ldots,x_n\}\) be a set consisting of \(n\) numbers and let \(Y = \{y_1,y_2,y_3,\dots,y_n\}\) be another set of \(n\) numbers.

What is traditional Regression?

Next simplest case

Let’s try this on diamond

x = diamond$carat
y = diamond$price

beta = sum(x*y)/sum(x^2)
beta
## [1] 2538.933

This is clearly not satisfactory,

fit = lm(y ~ x)
beta0 = coef(fit)[[1]]
beta1 = coef(fit)[[2]]
c(beta0,beta1)
## [1] -259.6259 3721.0249
as_tibble(x,y) %>% ggplot(aes(x = x, y = y)) +
  geom_point(alpha = 1/3, size = 6, color = "black") +
  geom_point(alpha = 1/3, size = 5, color = "red") +
  geom_abline(aes(slope = beta1, intercept = beta0)) +
  ggtitle("approximation with both intercepts")

So did we waste our time with the simple example?

xc = x - mean(x)
yc = y - mean(y)

fitc <- lm(yc ~ xc)
fitc
## 
## Call:
## lm(formula = yc ~ xc)
## 
## Coefficients:
## (Intercept)           xc  
##   5.145e-14    3.721e+03
beta_test = mean(xc*yc)/mean(xc^2)
beta_test
## [1] 3721.025

Covariance and Correlation

Back to our original formula for \(\beta_1 = \frac{\sum x_i y_i}{sum x_i^2}\)

beta1 = cov(x,y)/sd(x)^2
beta0 = mean(y) - beta1*mean(x)
rbind(c(beta0, beta1),c(coef(fit)[[1]],coef(fit)[[2]]))
##           [,1]     [,2]
## [1,] -259.6259 3721.025
## [2,] -259.6259 3721.025

Other things you can get out of lm

fit <- lm(y ~ x)
summary(fit)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -85.159 -21.448  -0.869  18.972  79.370 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -259.63      17.32  -14.99   <2e-16 ***
## x            3721.02      81.79   45.50   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.84 on 46 degrees of freedom
## Multiple R-squared:  0.9783, Adjusted R-squared:  0.9778 
## F-statistic:  2070 on 1 and 46 DF,  p-value: < 2.2e-16
fit$coefficients
## (Intercept)           x 
##   -259.6259   3721.0249
fit$residuals
##           1           2           3           4           5           6 
## -17.9483176  -7.7380691 -22.9483176 -85.1585661 -28.6303057   6.2619309 
##           7           8           9          10          11          12 
##  23.4721795  37.6311854 -38.7893116  24.4721795  51.8414339  40.7389488 
##          13          14          15          16          17          18 
##   0.2619309  13.4209369  -1.2098087  40.5287002  36.1029250 -44.8405542 
##          19          20          21          22          23          24 
##  79.3696943 -25.0508027  57.8414339   9.2619309 -20.9483176  -3.7380691 
##          25          26          27          28          29          30 
## -19.9483176  27.8414339 -54.9483176   8.8414339 -26.9483176  16.4721795 
##          31          32          33          34          35          36 
## -22.9483176 -13.1020453 -12.1020453  -0.5278205   3.2619309   2.2619309 
##          37          38          39          40          41          42 
##  -1.2098087 -43.2098087 -27.9483176 -23.3122938 -15.6303057  43.2672091 
##          43          44          45          46          47          48 
##  32.8414339   7.3696943   4.3696943 -11.5278205 -14.8405542  17.4721795
predict(fit)
##         1         2         3         4         5         6         7         8 
##  372.9483  335.7381  372.9483  410.1586  670.6303  335.7381  298.5278  447.3688 
##         9        10        11        12        13        14        15        16 
##  521.7893  298.5278  410.1586  782.2611  335.7381  484.5791  596.2098  819.4713 
##        17        18        19        20        21        22        23        24 
##  186.8971  707.8406  670.6303  745.0508  410.1586  335.7381  372.9483  335.7381 
##        25        26        27        28        29        30        31        32 
##  372.9483  410.1586  372.9483  410.1586  372.9483  298.5278  372.9483  931.1020 
##        33        34        35        36        37        38        39        40 
##  931.1020  298.5278  335.7381  335.7381  596.2098  596.2098  372.9483  968.3123 
##        41        42        43        44        45        46        47        48 
##  670.6303 1042.7328  410.1586  670.6303  670.6303  298.5278  707.8406  298.5278
ggplot() +
  geom_point(aes(x = x, y = y)) +
  geom_point(aes(x = x, y = predict(fit)),color = "red", shape = 'o',size = 4) +
  geom_abline(aes(slope = fit$coefficients[[2]], intercept = fit$coefficients[[1]]))