Stats 425 Notes

Statistical Modeling

Problem → Collection of Data → Data analysis → Conclusion
Single Variables: box plots, histogram, density plots
Bi-variate and multivariate: scatter plots, interactive graphics

Linear Modeling: used for explaining or modelinh the relationship between variable Y and one or more variables: X1, X2, …., Xp
* Y: dependent variable, response, outcome, output variables
* X1, X2…Xp: independent, predictor, input, explantory variables
Response Variable Y must be continuous
Explantory variables: X1, X2,…Xp can be continuous, discrete or catagorical

Model types I

Simple Regression: p = 1
Mutiple Regression: P > 1
Analysis of Covariance(ANCOVA): Mixture of quantitative and qualitative explanatory
Analysis of Variance(ANOVA): Qualitative explantory variables

Francis Galton Example

For a response y and a single predictor x, we can write equation: $\frac{y - \bar y}{SDy} = r \frac{x - \bar x}{SDx}$
r: the sample correlation between x and y
In the simple term: $y = + x $

Simple Linear Regression

Example

library(MASS)
help(cats)
summary(cats)

##  Sex         Bwt             Hwt       
##  F:47   Min.   :2.000   Min.   : 6.30  
##  M:97   1st Qu.:2.300   1st Qu.: 8.95  
##         Median :2.700   Median :10.10  
##         Mean   :2.724   Mean   :10.63  
##         3rd Qu.:3.025   3rd Qu.:12.12  
##         Max.   :3.900   Max.   :20.50

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ dplyr::select() masks MASS::select()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

ggplot(cats,aes(x=Bwt,y=Hwt, color = Sex))+
  geom_point()

Note

Goal is to describe the relationship between heart weight(Hwt) and body weight(Bwt). As a starting point, we assume the relationship is linear
Data of the form: $(y_i, x_i), i = 1, …,n $ where $y_i, x_i \epsilon R$
Apparently the data won’t be able to fit on a straight line Assume $y_i = \beta_0+\beta_1x_1+e_i$
$(\beta_0, \beta_1):$ unknown regression coefficients
$e_i:$ random errors often assume to have zero mean and variance $\sigma^2$

Population

Linear Regression Model: $y_i= \beta_0 + \beta_1x_i+e_i$
$e_i$~$N(0, \sigma^2)$

Sample

Estimated regression equation: $\hat y_i= \hat\beta_o+ \hat\beta_1x_i$
Residual: $r_i=y_i-\hat y_i$
- $y_i$: obersation(obs)
- $\hat y_i$: predication(pred)
Residual Sum Square (RSS): $\sum{r_i^2}$
Parameter estimators: $\hat\beta_0, \hat\beta_1, \hat\sigma^2$
- Random Variables
- vary from sample to sample Class Jargon: fitted value, residual, Residual Sum of Squares(RSS), R-squared(used to assess the overall model fit)

Paramter estimation by Least Squares

Least Square Estimation: find $(\beta_0, \beta_1)$ that minimize the residual sum of squares(RSS)
- $RSS = \sum{(y_i-\hat y_i)^2}= \sum{(y_i-(\beta_0+ x_i))^2}$
  $\frac{\partial RSS}{\partial \beta_0}= -2\sum{y_i-\beta_0-\beta_1x_i}$
  $\frac{\partial RSS}{\partial \beta_1}= -2\sum{x_i(y_i-\beta_0-\beta_1x_i)}$
Rearrange the equations
- $\beta_0n+ \beta_1\sum{x_i}= \sum{y_i}$ (1)
- $\beta_0\sum{x_i}+ \beta_1\sum{x_i}^2= \sum{x_iy_i}$ (2)
From (1), we have $\bar y = \hat\beta_0+\beta_1\bar x$
- $(\bar x, \bar y)$ is always on the estimated regression line
Plug (1) back to (2)
- $\hat\beta_1= \frac{\sum{x_i(y_i-\bar y)}}{\sum{x_i(x_i-\bar x)}}$
LS estimates of $(\beta_0, \beta_1)$ can be expressed as
- \[\hat\beta_0= \bar y - \hat\beta_1 \bar x\]
- \[\hat\beta_1=\frac{S_{xy}}{S_{xy}}=r_{xy}*(\frac{S_{yy}}{S_{xx}})^2\]
  - $S_{xy}=\sum(x_i-\bar x)(y_i-\bar y)$
  - $S_{xx}=\sum(x_i-\bar x)^2$
  - $S_{yy}=\sum(y_i-\bar y)^2$
  - $r_{xy}=\frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}}$ (the sample corelation)
final equation of y as a function of x is given by:
- \[y \approx (\bar y - r_{xy}\sqrt{\frac{S_{yy}}{S_{xx}}}\bar x)+ (r_{xy}\sqrt{\frac{S_{yy}}{S_{xx}}})x\]

Jargon

fitted value/prediction at $x_i$: $\hat y_i = \hat\beta_0 + \hat\beta_1 x_i$
Residual at $x_i$: $r_i= y_i- \hat y_i$
- $\sum{r_i}=0$
- $\sum{x_ir_i}=0$
Residual Sum of Squares(RSS): $\sum{r_i^2}$
The error variance is estimated as: \[\sigma^2= \frac{1}{n-2}*RSS =\frac{1}{n-2}\sum r_i^2 \]
residual degree of freedom(df): n - 2.
- Normally, df = sample size - number of paramters

Goodness of fit: R-Square

Total Sum of Squares(TSS): total variation of y; can be decomposed into the sum of total variation of the fitted value $\hat y$ (FSS) and the Residual Sum of Squares (RSS)
\[TSS = (y_i-\bar y)^2 = \sum{r_i}^2+ \sum{(\hat y_i-\bar y)^2} = RSS + FSS\]

out = lm(Hwt ~ Bwt, data = cats)
summary(out)

## 
## Call:
## lm(formula = Hwt ~ Bwt, data = cats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5694 -0.9634 -0.0921  1.0426  5.1238 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.3567     0.6923  -0.515    0.607    
## Bwt           4.0341     0.2503  16.119   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.452 on 142 degrees of freedom
## Multiple R-squared:  0.6466, Adjusted R-squared:  0.6441 
## F-statistic: 259.8 on 1 and 142 DF,  p-value: < 2.2e-16

$R^2=0.6466 = 64.66$%
- interpreation: about 64.66% of the total variation can be explained by the linear relationship between Bwt(x) and Hwt(y)

Extract Information and make some calculations

names(out)

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

out$coef

## (Intercept)         Bwt 
##  -0.3566624   4.0340627

attach(cats)
cor(Hwt,Bwt)^2

## [1] 0.6466209