Statistical Modeling

Linear Modeling: used for explaining or modelinh the relationship between variable Y and one or more variables: X1, X2, …., Xp
* Y: dependent variable, response, outcome, output variables
* X1, X2…Xp: independent, predictor, input, explantory variables
Response Variable Y must be continuous
Explantory variables: X1, X2,…Xp can be continuous, discrete or catagorical

Model types I

Francis Galton Example

For a response y and a single predictor x, we can write equation: \(\frac{y - \bar y}{SDy} = r \frac{x - \bar x}{SDx}\)
r: the sample correlation between x and y
In the simple term: $y = + x $

Simple Linear Regression

Example

library(MASS)
help(cats)
summary(cats)
##  Sex         Bwt             Hwt       
##  F:47   Min.   :2.000   Min.   : 6.30  
##  M:97   1st Qu.:2.300   1st Qu.: 8.95  
##         Median :2.700   Median :10.10  
##         Mean   :2.724   Mean   :10.63  
##         3rd Qu.:3.025   3rd Qu.:12.12  
##         Max.   :3.900   Max.   :20.50
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ dplyr::select() masks MASS::select()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
ggplot(cats,aes(x=Bwt,y=Hwt, color = Sex))+
  geom_point()

Note

Population

  • Linear Regression Model: \(y_i= \beta_0 + \beta_1x_i+e_i\)
    \(e_i\)~\(N(0, \sigma^2)\)

Sample

  • Estimated regression equation: \(\hat y_i= \hat\beta_o+ \hat\beta_1x_i\)
  • Residual: \(r_i=y_i-\hat y_i\)
    • \(y_i\): obersation(obs)
    • \(\hat y_i\): predication(pred)
  • Residual Sum Square (RSS): \(\sum{r_i^2}\)
  • Parameter estimators: \(\hat\beta_0, \hat\beta_1, \hat\sigma^2\)
    • Random Variables
    • vary from sample to sample Class Jargon: fitted value, residual, Residual Sum of Squares(RSS), R-squared(used to assess the overall model fit)

Paramter estimation by Least Squares

  • Least Square Estimation: find \((\beta_0, \beta_1)\) that minimize the residual sum of squares(RSS)

    • \(RSS = \sum{(y_i-\hat y_i)^2}= \sum{(y_i-(\beta_0+ x_i))^2}\)
      \(\frac{\partial RSS}{\partial \beta_0}= -2\sum{y_i-\beta_0-\beta_1x_i}\)
      \(\frac{\partial RSS}{\partial \beta_1}= -2\sum{x_i(y_i-\beta_0-\beta_1x_i)}\)
  • Rearrange the equations

    • \(\beta_0n+ \beta_1\sum{x_i}= \sum{y_i}\) (1)
    • \(\beta_0\sum{x_i}+ \beta_1\sum{x_i}^2= \sum{x_iy_i}\) (2) 
  • From (1), we have \(\bar y = \hat\beta_0+\beta_1\bar x\)

    • \((\bar x, \bar y)\) is always on the estimated regression line
  • Plug (1) back to (2)

    • \(\hat\beta_1= \frac{\sum{x_i(y_i-\bar y)}}{\sum{x_i(x_i-\bar x)}}\)
  • LS estimates of \((\beta_0, \beta_1)\) can be expressed as

    • \[\hat\beta_0= \bar y - \hat\beta_1 \bar x\]
    • \[\hat\beta_1=\frac{S_{xy}}{S_{xy}}=r_{xy}*(\frac{S_{yy}}{S_{xx}})^2\]
      • \(S_{xy}=\sum(x_i-\bar x)(y_i-\bar y)\)
      • \(S_{xx}=\sum(x_i-\bar x)^2\)
      • \(S_{yy}=\sum(y_i-\bar y)^2\)
      • \(r_{xy}=\frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}}\) (the sample corelation)
  • final equation of y as a function of x is given by:

    • \[y \approx (\bar y - r_{xy}\sqrt{\frac{S_{yy}}{S_{xx}}}\bar x)+ (r_{xy}\sqrt{\frac{S_{yy}}{S_{xx}}})x\]

Jargon

  • fitted value/prediction at \(x_i\): \(\hat y_i = \hat\beta_0 + \hat\beta_1 x_i\)
  • Residual at \(x_i\): \(r_i= y_i- \hat y_i\)
    • \(\sum{r_i}=0\)
    • \(\sum{x_ir_i}=0\)
  • Residual Sum of Squares(RSS): \(\sum{r_i^2}\)
  • The error variance is estimated as: \[\sigma^2= \frac{1}{n-2}*RSS =\frac{1}{n-2}\sum r_i^2 \]
  • residual degree of freedom(df): n - 2.
    • Normally, df = sample size - number of paramters

Goodness of fit: R-Square

  • Total Sum of Squares(TSS): total variation of y; can be decomposed into the sum of total variation of the fitted value \(\hat y\) (FSS) and the Residual Sum of Squares (RSS)
  • \[TSS = (y_i-\bar y)^2 = \sum{r_i}^2+ \sum{(\hat y_i-\bar y)^2} = RSS + FSS\]
out = lm(Hwt ~ Bwt, data = cats)
summary(out)
## 
## Call:
## lm(formula = Hwt ~ Bwt, data = cats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5694 -0.9634 -0.0921  1.0426  5.1238 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.3567     0.6923  -0.515    0.607    
## Bwt           4.0341     0.2503  16.119   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.452 on 142 degrees of freedom
## Multiple R-squared:  0.6466, Adjusted R-squared:  0.6441 
## F-statistic: 259.8 on 1 and 142 DF,  p-value: < 2.2e-16
  • \(R^2=0.6466 = 64.66\)%
    • interpreation: about 64.66% of the total variation can be explained by the linear relationship between Bwt(x) and Hwt(y)

Extract Information and make some calculations

names(out)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"
out$coef
## (Intercept)         Bwt 
##  -0.3566624   4.0340627
attach(cats)
cor(Hwt,Bwt)^2
## [1] 0.6466209

Different ways to find R^2