October 25, 2016

Beta coefficients

What we saw before:

normalize <- function(vect){
  vbar <- mean(vect)
  z <- (vect-vbar)/sd(vect)
  return(z)
}
mtcars.n <- mtcars %>% mutate_each(funs(normalize))

Beta coefficients

An alternative:

mtcars.n2 <- mtcars %>% mutate_each(funs(scale))
mtcars.n3 <- scale(mtcars)[1:32,]

Note: scale(mtcars) returns the means and standard deviation of each variable after the scaled data. The [1:32,] addition tells R to only take the data from that output (because mtcars has 32 rows of data).

Beta coefficients

Even better, we can use it in our regression (as long as we don't have too many variables).

mod <- lm(scale(mpg) ~ scale(hp) + scale(wt) + scale(disp),mtcars)

Quick recap: logarithmic transformations

  1. Coefficients are interpretted as percent changes instead of unit changes.
  2. Useful for (strictly positive) dollar or population variables, where a normal distribution is unlikely to approximate the data.
    • For numbers bigger than 1, it reduces variability.
    • But for fractions it will magnify variability by making very large negative numbers.
  3. Transforming your left hand side (LHS) variable is often a good way to get closer to the CLRM assumptions
  4. log(x) only works for \(x >0\). log(x+1) can ameliorate this problem, but makes interpretation more difficult.

Quick recap: quadratic terms

  1. A quadratic term allows for increasing or decreasing returns
  2. The turning point (min or max value of \(\hat{y}\), holding constant other variables) can and should be calculated (one of many "smell tests")
    • If both terms (\(x\) and \(x^2\)) have the same sign, that point will be at a negative value of \(x\).
  3. The quadratic coefficient is probably small, but its effect might become very big for sufficiently large \(x\) values.

An over-kill look at lm()

Example: Different ways to get the same thing.

  • I want an OLS model so I have to use function lm().
  • lm takes different arguments and expects them in a particular order (which you can see by looking at the help file by running ?lm).
  • The first argument it expects is formula.
  • The formula takes the form of LHS ~ RHS.
    • We use the ~ because = already means too many things in R.
    • We can often transform variables inside of our formula
    • If we want to exclude the intercept term (which we might want to do when using categorical data), subtract 1 from the formula
      • qsec ~ as.character(cyl) + hp - 1

Example: Different ways to get the same thing.

  • The second argument is data. Our data is stored in an object with a name (in this case mtcars)
  • If we put the arguments in the default order, we don't have to tell lm() which is which.
  • If we put things out of order (or if we're not sure), we should explicitly tell it "I'm estimating a linear model where the formula is y ~ x1 + x2 and the data is in the thing called whatever.my.data.is.named"

Example: Different ways to get the same thing.

fit1a <- lm(qsec ~ as.character(cyl) + hp - 1, mtcars)
fit1b <- lm(data = mtcars, formula = qsec ~ as.character(cyl) + hp - 1)
fit1c <- mtcars %>% lm(formula = qsec ~ as.character(cyl) + hp - 1)
fit1a %>% summary()
summary(fit1b)

Interaction terms!

  • Sometimes the effect of one variable depends on the value of another.
    • e.g. the effects of practice might depend on having sufficient background skills/aptitudes. LeBron's practice time will have greater returns than my practice time.
load("hprice3.RData")
fit0 <- lm(scale(price) ~ scale(cbd) + scale(area), data)
fit1 <- lm(scale(price) ~ scale(cbd)*scale(area), data)

Interaction terms

Call: lm(formula = scale(price) ~ scale(cbd) + scale(area), data = data)
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.346e-17  4.232e-02   0.000  1.00000    
scale(cbd)   1.120e-01  4.304e-02   2.602  0.00969 ** 
scale(area)  6.259e-01  4.304e-02  14.542  < 1e-04 ***

Residual standard error: 0.7583 on 318 degrees of freedom
Multiple R-squared:  0.4286,    Adjusted R-squared:  0.425 
F-statistic: 119.3 on 2 and 318 DF,  p-value: < 1e-04

Interaction terms

Call: lm(formula = scale(price) ~ scale(cbd) * scale(area), data = data)
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)            -0.01126    0.04313  -0.261  0.79425    
scale(cbd)              0.12849    0.04478   2.869  0.00439 ** 
scale(area)             0.63728    0.04385  14.532  < 1e-04 ***
scale(cbd):scale(area)  0.06509    0.04948   1.316  0.18929    

Residual standard error: 0.7574 on 317 degrees of freedom
Multiple R-squared:  0.4317,    Adjusted R-squared:  0.4264 
F-statistic: 80.28 on 3 and 317 DF,  p-value: < 1e-04

The positive scale(cbd):scale(area) term (the "interaction effect") implies that larger houses are worth even more when they're closer to the central business district. (Mind you, the effect is small and statistically insignificant.)

Interpretting interaction terms

\[\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2+ \beta_3 x_1 x_2 + u\]

What is the effect (on \(y\)) of increasing \(x_1\) by one unit?

Interpretting interaction terms

\[\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2+ \beta_3 x_2 x_3 + u\]

What is the effect (on \(y\)) of increasing \(x_1\) by one unit?

\(\beta_1 + \beta_3 x_2\).

In effect, \(\beta_1\) is the effect of \(x_1\) when \(x_2=0\).

Interpretting interaction terms

\[\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2+ \beta_3 x_2 x_3 + u\]

By centering (demeaning) the data, we can interpret \(\beta_1\) as the effect of \(x_1\) when \(x_2 = \bar{x_2}\). If you want regular coefficients (i.e. not beta coefficients) you can use the scale() function with the option scale set to FALSE (all caps, or 'F' for short).

fit2 <- lm(scale(price,scale=F) ~ scale(cbd,scale=F)*scale(area,scale=F), data)
hprice <- data
hprice.desc <- desc

Another example

load("attend.RData")
attend <- data
attend <- data %>% mutate(ACT2 = ACT^2,
                        priGPA2 = priGPA^2)
fit3 <- lm(stndfnl ~ atndrte*priGPA + ACT + ACT2 + priGPA2, attend)

Example 6.3

Call: lm(formula = stndfnl ~ atndrte * priGPA + ACT + ACT2 + priGPA2, 
    data = attend)
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     2.050293   1.360319   1.507 0.132225    
atndrte        -0.006713   0.010232  -0.656 0.512005    
priGPA         -1.628540   0.481003  -3.386 0.000751 ***
ACT            -0.128039   0.098492  -1.300 0.194047    
ACT2            0.004533   0.002176   2.083 0.037634 *  
priGPA2         0.295905   0.101049   2.928 0.003523 ** 
atndrte:priGPA  0.005586   0.004317   1.294 0.196173    

Residual standard error: 0.8729 on 673 degrees of freedom
Multiple R-squared:  0.2287,    Adjusted R-squared:  0.2218 
F-statistic: 33.25 on 6 and 673 DF,  p-value: < 1e-04

Example 6.3

\(\beta_atndrte = -0.006713\), implying that lower attendance rates are good for final exam scores. But \(\beta_{atndrte,priGPA} = 0.005586\) which tells us that for a prior GPA above 1.2 there is a positive affect of attendance overall.

Example 6.3: t-tests

The coefficients are statistically insignificant, but that's probably just due to multicollinearity (since both terms include atndrte).

To get a better idea of what's going on, we should us an F test. Here we're restricting \(\beta_atndrte\) and \(\beta_{atndrte,priGPA}\) to both equal 0 (which we do by leaving them out of the restricted model).

Example 6.3: F test

fit3.r <- lm(stndfnl ~ priGPA + ACT + ACT2 + priGPA2, attend)
print(anova(fit3,fit3.r))
Analysis of Variance Table

Model 1: stndfnl ~ atndrte * priGPA + ACT + ACT2 + priGPA2
Model 2: stndfnl ~ priGPA + ACT + ACT2 + priGPA2
  Res.Df    RSS Df Sum of Sq     F  Pr(>F)  
1    673 512.76                             
2    675 519.34 -2   -6.5814 4.319 0.01368 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Jointly, these variables are statistically significant with a p-value of 0.014!

Even though individually they're statistically insignificant, we shouldn't be so ready to leave them out of the regression!

Example 6.3: demeaned data

data.dm <- attend %>% mutate_each(funs(scale(.,scale=FALSE)))
fit3.beta <- lm(stndfnl ~ atndrte*priGPA + ACT + ACT2 + priGPA2, data.dm)
print(summary(fit3.beta),concise = TRUE)
Call: lm(formula = stndfnl ~ atndrte * priGPA + ACT + ACT2 + priGPA2, 
    data = data.dm)
                Estimate Std. Error t value Pr(>|t|)   
(Intercept)    -0.022125   0.037588  -0.589  0.55631   
atndrte         0.007737   0.002633   2.938  0.00341 **
priGPA         -1.172118   0.529975  -2.212  0.02733 * 
ACT            -0.128039   0.098492  -1.300  0.19405   
ACT2            0.004533   0.002176   2.083  0.03763 * 
priGPA2         0.295905   0.101049   2.928  0.00352 **
atndrte:priGPA  0.005586   0.004317   1.294  0.19617   

Residual standard error: 0.8729 on 673 degrees of freedom
Multiple R-squared:  0.2287,    Adjusted R-squared:  0.2218 
F-statistic: 33.25 on 6 and 673 DF,  p-value: < 1e-04

Example 6.3: demeaned data

When we demean the data our results changed! But only for the coefficients involved with the interaction term. And the change made interpretation easier

Call: lm(formula = stndfnl ~ atndrte * priGPA + ACT + ACT2 + priGPA2, 
    data = attend)
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     2.050293   1.360319   1.507 0.132225    
atndrte        -0.006713   0.010232  -0.656 0.512005    
priGPA         -1.628540   0.481003  -3.386 0.000751 ***
ACT            -0.128039   0.098492  -1.300 0.194047    
ACT2            0.004533   0.002176   2.083 0.037634 *  
priGPA2         0.295905   0.101049   2.928 0.003523 ** 
atndrte:priGPA  0.005586   0.004317   1.294 0.196173    

Residual standard error: 0.8729 on 673 degrees of freedom
Multiple R-squared:  0.2287,    Adjusted R-squared:  0.2218 
F-statistic: 33.25 on 6 and 673 DF,  p-value: < 1e-04

Adjusted \(R^2\)

  • Increasing \(R^2\) is easy… just over-fit. Add any variable you can find and throw it into your regression.

  • This means \(R^2\) is only a good measure of goodness of fit when we're dealing with honest and alert researchers (who here didn't do their last homework assignment in the middle of the night?).

Recall:

\[R^2 = \frac{SSE}{SST} = 1 - \frac{SSR}{SST}\]

Adjusted \(R^2\)

\[R^2 = 1 - \frac{SSR/n}{SST/n}\]

  • Dividing numerator/denominator by \(n\) doesn't change anything, but makes it look a bit more like the population variance.

  • Remember, \(\hat{var_x} = \frac{total variation}{n-1}\), but for population variance you divide by \(n\).
    • For sample variance we divide by \(n-1\) so that we don't bias our estimate of the variance.

Adjusted \(R^2\)

  • For sample \(R^2\) we want to do something similar to estimates of sample variance.

  • Since we're estimating \(k + 1\) coefficients (each \(x\) variable and the intercept term) we should adjust our sum of squared residuals (SSR) accordingly.

  • For total variation (SST) we'll follow the example of sample variance and adjust by 1 (since SST is relative to an estimated sample mean… i.e. there's one parameter being estimated).

Adjusted \(R^2\)

\[\bar{R}^2 = 1 - \frac{SSR/(n-k-1)}{SST/(n-1)}\]

  • Strictly speaking, this doesn't get us closer to knowing anything about the true/underlying/population variation.

  • What's nice about \(\bar{R}^2\) is that it keeps researchers honest.
    • Adding variables always increases \(R^2\), but
    • \(\bar{R}^2\) can decrease because of the \(k\) term.
    • If the new variable doesn't decrease our residuals then it will decrease \(\bar{R}^2\).

Nested and non-nested models

  • Earlier we looked at fit3 and fit3.r. The latter was a special case of the former… it was fit3, but with specific values for some of the slope coefficients (specifically \(\beta_i = 0\))
    • (*side note: run ?car::linearHypothesis for details about testing other restrictions… skip down to the examples for something that makes sense)
  • fit3.r was a nested model… it fits inside of fit3.
    • In these cases we can use F-tests (or compare \(\bar{R}^2\)) to choose the better model.
  • With non-nested models we can't always rely on these comparisons.
    • If both have the same \(y\) variable, then \(\bar{R}^2\) can be a reasonable guide in choosing the "better" model.

Residual analysis

  • You've assembled a dataset of houses for sale with their features (bedrooms, age, heating system, etc.), prices, etc. You run a regression which shows unsurprising results:
    • Valuable features like bathrooms and modern heating are associated with higher prices
    • Big houses in bad school districts are worth less.
    • etc.
  • Which house do you buy?

Residual analysis

  • Ceteris paribus you want to buy the one that's most underpriced. (Biggest bang for your buck.)
  • If you've controlled for all the features you care about
    • Decide what sort of house you want (e.g. 2 beds, 2 baths, old but with good plumbing, near a park, etc.)
    • Filter you list of houses down to the ones that meet your criteria
    • Compare their actual price to the price predicted by your model
    • The one with the largest negative residual is the one that, given its observable features, is most underpriced.

Beware!

  • The weakness of this approach is that you're only able to take account of as many features as you've got data for.
  • Some relevant features might be difficult to get data for (e.g. does it have a sensible layout?)
  • Some features might be impossible to get data for (e.g. is it haunted?)

Residual analysis

  • Examples:
    • Finding the best deals (e.g. house, car, school, etc.)
    • Predicting economic growth
  • Takeaway: sometimes the residuals are an interesting variable in their own right.