Unmixed Models (Regression Review)

Phillip M. Alday
12 August 2016 (Melbourne)

What does linear regression do?

Determines the line of best fit

“Fits” a line

What do we mean with best fit?

The line such that minimizes the summed distances from the line to the data points.

But "distance" is ambiguous in two ways ....

Metric used
Direction of measurement

Metric used

Where do all the sums of squares come from?

Distance must be positive!

Subtraction not commutative: \( a - b \neq b - a \)
But is antisymmetric: \( a - b = -(b-a) \)
Get rid of the minus signs to get commutativity
Obvious way: absolute value
Another way: squaring

A unifying perspective on non-negativity

\( d_2(a,b) = |a - b|^2 \)
\( d_1(a,b) = |a - b|^1 \)
\( d_0(a,b) = |a - b|^0 \) if we let \( 0^0 = 0 \)

We can do this for arbitrary powers – these are the so-called \( L_p \) norms

(Fine print: \( 0^0 \) is in general undefined because it is equal to \( \frac{0}{0} \) and division by zero is undefined, but this definition is internally consistent here.)

What does distance look like for these?

plot of chunk unnamed-chunk-1

What does distance look like for these?

plot of chunk unnamed-chunk-2

Measures of Central Tendency

A measure of central tendency yields the point(s) closest to all the most points, i.e. minimizes error (and is therefore in some sense the “expected value”)
- Minimize \( L_0 \) – only distinctness matters: Mode
- Minimize \( L_1 \) – only sequence matters: Median
- Minimize \( L_2 \) – penalize heavily for extremes: Mean
Along with the variance, this is a very convenient way to summarise many types of data, especially normal (Gaussian) and “normal-like” data (e.g. \( t \)-distribution)
But be careful, both measures can be misleading (see e.g. Rand Wilcox's examples on mixtures of normals in psychology)

Inspired by some comments from John Myles White

Sum of Squares

\( L_2 \) and hence the mean very sensitive to outliers
both a good thing and a bad thing
- minimizes the number of points very far away
- at the cost of moving some near points further away
square distance also related to the Euclidean distance, i.e. the distance we use in everyday life
median common in robust statistics
- not as good for “clean” data, but higher breakdown point (see Wilcox)
- deep questions about what “outlier” actually means for the interpreation
The sample mean is BLUE

Direction of measurement

An example

plot of chunk unnamed-chunk-3

Minimizing vertical distance

plot of chunk unnamed-chunk-4

Minimizing horizontal distance

plot of chunk unnamed-chunk-5

Not the same!

plot of chunk unnamed-chunk-6

Why do we minimize vertical distance?

Minimizing vertical distance has important implications!

The assumption is that we can manipulate / measure the independent variables / predictors exactly
Often in psychology we can't
Or even worse, we measure proxies
This has profound impacts, see Westfall & Yarkoni 2016
SEM and errors-in-variables models address parts of this problem
(a talk for another time)

What would minimizing "diagonal" distance look like?

Models and estimation

The general linear model

\[ Y = \beta_{1}X_1 + \beta_{0} + \varepsilon \]

Independent variable(s) measured without error
Dependent variable a linear function of independent variable plus some error / random variation
Error normally distributed
Assumptions about the distribution of the error very important!

Likelihood

\( p(H|D) \): posterior probability
\( p(D|H) \): likelihood

Related via Bayes' Theorem: \[ p(H|D) = \frac{p(D|H)p(H)}{p(D)} \]

Likelihood vs. Posterior

You need to remember that “likelihood” is a technical term. The likelihood of \( H \), \( Pr(O\|H) \), and the posterior probability of \( H \), \( Pr(H\|O) \), are different quantities and they can have different values. The likelihood of \( H \) is the probability that \( H \) confers on \( O \), not the probability that \( O \) confers on \( H \). Suppose you hear a noise coming from the attic of your house. You consider the hypothesis that there are gremlins up there bowling. The likelihood of this hypothesis is very high, since if there are gremlins bowling in the attic, there probably will be noise. But surely you don’t think that the noise makes it very probable that there are gremlins up there bowling. In this example, \( Pr(O\|H) \) is high and \( Pr(H\|O) \) is low. The gremlin hypothesis has a high likelihood (in the technical sense) but a low probability.

Sober, E. (2008). Evidence and Evolution: the Logic Behind the Science. Cambridge University Press.

Maximum Likelihood Estimation

Find the coefficients / predictor that maximizes the likelihood
i.e. find the quantitative hypothesis which has the best chance of generating the observed data
many numerical, iterative techniques for doing this (implemented in R with glm())

Ordinary Least Squares

special case with closed-form solution for linear regression using the mean and vertical distance (implemented in R with lm())
BLUE
these two properties: computational simplicity and “ideal” estimator under certain circumstances led to widespread use before the advent of modern computers

MLE: General method

summary(glm(y~x))


Call:
glm(formula = y ~ x)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-181.168   -39.710     4.854    39.162   133.582  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  46.1361    12.6521   3.647 0.000428 ***
x             2.6567     0.2175  12.214  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 3942.172)

    Null deviance: 974458  on 99  degrees of freedom
Residual deviance: 386333  on 98  degrees of freedom
AIC: 1115.7

Number of Fisher Scoring iterations: 2

MLE: OLS

summary(lm(y~x))


Call:
lm(formula = y ~ x)

Residuals:
     Min       1Q   Median       3Q      Max 
-181.168  -39.710    4.854   39.162  133.582 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  46.1361    12.6521   3.647 0.000428 ***
x             2.6567     0.2175  12.214  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 62.79 on 98 degrees of freedom
Multiple R-squared:  0.6035,    Adjusted R-squared:  0.5995 
F-statistic: 149.2 on 1 and 98 DF,  p-value: < 2.2e-16

Some of the assumptions of linear regression

independence of errors
equal variance of errors (homogeneity of variance, i.e. homoskedasticity)
normality of errors

Categorical variables, t-tests and ANOVA

xyplot(weight ~ feed,data=chickwts,main="Effect of diet on chick growth",ylab="weight",xlab="feed")

plot of chunk unnamed-chunk-9

Pairwise comparison

pair <- subset(chickwts,feed %in% c("casein","horsebean"),drop=TRUE)
xyplot(weight ~ feed,data=pair,main="Effect of diet on chick growth",ylab="weight",xlab="feed")

plot of chunk unnamed-chunk-10

Two-sample t-test and ANOVA

t.test(weight ~ feed,data=pair,var.equal=TRUE,paired=FALSE)


    Two Sample t-test

data:  weight by feed
t = 7.0197, df = 20, p-value = 8.255e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 114.8329 211.9338
sample estimates:
   mean in group casein mean in group horsebean 
               323.5833                160.2000

summary(aov(weight ~ feed,data=pair))

            Df Sum Sq Mean Sq F value   Pr(>F)    
feed         1 145604  145604   49.28 8.25e-07 ***
Residuals   20  59097    2955                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A regression perspective

Does the slope of this line differ from zero?

xyplot(weight ~ feed,data=pair,main="Effect of diet on chick growth",ylab="weight",xlab="feed",type=c("p","r"))

plot of chunk unnamed-chunk-13

A regression perspective

summary(lm(weight ~ feed,data=pair))


Call:
lm(formula = weight ~ feed, data = pair)

Residuals:
    Min      1Q  Median      3Q     Max 
-107.58  -33.20    3.80   42.17   80.42 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)     323.58      15.69   20.62 6.02e-15 ***
feedhorsebean  -163.38      23.27   -7.02 8.25e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 54.36 on 20 degrees of freedom
Multiple R-squared:  0.7113,    Adjusted R-squared:  0.6969 
F-statistic: 49.28 on 1 and 20 DF,  p-value: 8.255e-07

anova(lm(weight ~ feed,data=pair))

Analysis of Variance Table

Response: weight
          Df Sum Sq Mean Sq F value    Pr(>F)    
feed       1 145604  145604  49.277 8.255e-07 ***
Residuals 20  59097    2955                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A regression perspective

summary(lm(weight ~ feed,data=chickwts))


Call:
lm(formula = weight ~ feed, data = chickwts)

Residuals:
     Min       1Q   Median       3Q      Max 
-123.909  -34.413    1.571   38.170  103.091 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    323.583     15.834  20.436  < 2e-16 ***
feedhorsebean -163.383     23.485  -6.957 2.07e-09 ***
feedlinseed   -104.833     22.393  -4.682 1.49e-05 ***
feedmeatmeal   -46.674     22.896  -2.039 0.045567 *  
feedsoybean    -77.155     21.578  -3.576 0.000665 ***
feedsunflower    5.333     22.393   0.238 0.812495    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 54.85 on 65 degrees of freedom
Multiple R-squared:  0.5417,    Adjusted R-squared:  0.5064 
F-statistic: 15.36 on 5 and 65 DF,  p-value: 5.936e-10

A regression perspective

anova(lm(weight ~ feed,data=chickwts))

Analysis of Variance Table

Response: weight
          Df Sum Sq Mean Sq F value    Pr(>F)    
feed       5 231129   46226  15.365 5.936e-10 ***
Residuals 65 195556    3009                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What is the (usual) null hypothesis for ANOVA with more than two categories?

That all contrasts are simultanesously zero

Contrasts matter!

library(car)
options(contrasts=c("contr.Sum","contr.poly"))
summary(lm(weight ~ feed,data=chickwts))


Call:
lm(formula = weight ~ feed, data = chickwts)

Residuals:
     Min       1Q   Median       3Q      Max 
-123.909  -34.413    1.571   38.170  103.091 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        259.131      6.543  39.602  < 2e-16 ***
feed[S.casein]      64.452     14.490   4.448 3.47e-05 ***
feed[S.horsebean]  -98.931     15.601  -6.341 2.48e-08 ***
feed[S.linseed]    -40.381     14.490  -2.787  0.00697 ** 
feed[S.meatmeal]    17.778     15.005   1.185  0.24042    
feed[S.soybean]    -12.703     13.641  -0.931  0.35519    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 54.85 on 65 degrees of freedom
Multiple R-squared:  0.5417,    Adjusted R-squared:  0.5064 
F-statistic: 15.36 on 5 and 65 DF,  p-value: 5.936e-10