What we saw before:
normalize <- function(vect){
vbar <- mean(vect)
z <- (vect-vbar)/sd(vect)
return(z)
}
mtcars.n <- mtcars %>% mutate_each(funs(normalize))
October 25, 2016
What we saw before:
normalize <- function(vect){
vbar <- mean(vect)
z <- (vect-vbar)/sd(vect)
return(z)
}
mtcars.n <- mtcars %>% mutate_each(funs(normalize))
An alternative:
mtcars.n2 <- mtcars %>% mutate_each(funs(scale)) mtcars.n3 <- scale(mtcars)[1:32,]
Note: scale(mtcars) returns the means and standard deviation of each variable after the scaled data. The [1:32,] addition tells R to only take the data from that output (because mtcars has 32 rows of data).
Even better, we can use it in our regression (as long as we don't have too many variables).
mod <- lm(scale(mpg) ~ scale(hp) + scale(wt) + scale(disp),mtcars)
log(x) only works for \(x >0\). log(x+1) can ameliorate this problem, but makes interpretation more difficult.lm()lm().lm takes different arguments and expects them in a particular order (which you can see by looking at the help file by running ?lm).formula.LHS ~ RHS.
~ because = already means too many things in R.qsec ~ as.character(cyl) + hp - 1data. Our data is stored in an object with a name (in this case mtcars)lm() which is which.y ~ x1 + x2 and the data is in the thing called whatever.my.data.is.named"fit1a <- lm(qsec ~ as.character(cyl) + hp - 1, mtcars) fit1b <- lm(data = mtcars, formula = qsec ~ as.character(cyl) + hp - 1) fit1c <- mtcars %>% lm(formula = qsec ~ as.character(cyl) + hp - 1) fit1a %>% summary() summary(fit1b)
load("hprice3.RData")
fit0 <- lm(scale(price) ~ scale(cbd) + scale(area), data)
fit1 <- lm(scale(price) ~ scale(cbd)*scale(area), data)
Call: lm(formula = scale(price) ~ scale(cbd) + scale(area), data = data)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.346e-17 4.232e-02 0.000 1.00000
scale(cbd) 1.120e-01 4.304e-02 2.602 0.00969 **
scale(area) 6.259e-01 4.304e-02 14.542 < 1e-04 ***
Residual standard error: 0.7583 on 318 degrees of freedom
Multiple R-squared: 0.4286, Adjusted R-squared: 0.425
F-statistic: 119.3 on 2 and 318 DF, p-value: < 1e-04
Call: lm(formula = scale(price) ~ scale(cbd) * scale(area), data = data)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.01126 0.04313 -0.261 0.79425
scale(cbd) 0.12849 0.04478 2.869 0.00439 **
scale(area) 0.63728 0.04385 14.532 < 1e-04 ***
scale(cbd):scale(area) 0.06509 0.04948 1.316 0.18929
Residual standard error: 0.7574 on 317 degrees of freedom
Multiple R-squared: 0.4317, Adjusted R-squared: 0.4264
F-statistic: 80.28 on 3 and 317 DF, p-value: < 1e-04
The positive scale(cbd):scale(area) term (the "interaction effect") implies that larger houses are worth even more when they're closer to the central business district. (Mind you, the effect is small and statistically insignificant.)
\[\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2+ \beta_3 x_1 x_2 + u\]
What is the effect (on \(y\)) of increasing \(x_1\) by one unit?
\[\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2+ \beta_3 x_2 x_3 + u\]
What is the effect (on \(y\)) of increasing \(x_1\) by one unit?
\(\beta_1 + \beta_3 x_2\).
In effect, \(\beta_1\) is the effect of \(x_1\) when \(x_2=0\).
\[\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2+ \beta_3 x_2 x_3 + u\]
By centering (demeaning) the data, we can interpret \(\beta_1\) as the effect of \(x_1\) when \(x_2 = \bar{x_2}\). If you want regular coefficients (i.e. not beta coefficients) you can use the scale() function with the option scale set to FALSE (all caps, or 'F' for short).
fit2 <- lm(scale(price,scale=F) ~ scale(cbd,scale=F)*scale(area,scale=F), data) hprice <- data hprice.desc <- desc
load("attend.RData")
attend <- data
attend <- data %>% mutate(ACT2 = ACT^2,
priGPA2 = priGPA^2)
fit3 <- lm(stndfnl ~ atndrte*priGPA + ACT + ACT2 + priGPA2, attend)
Call: lm(formula = stndfnl ~ atndrte * priGPA + ACT + ACT2 + priGPA2,
data = attend)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.050293 1.360319 1.507 0.132225
atndrte -0.006713 0.010232 -0.656 0.512005
priGPA -1.628540 0.481003 -3.386 0.000751 ***
ACT -0.128039 0.098492 -1.300 0.194047
ACT2 0.004533 0.002176 2.083 0.037634 *
priGPA2 0.295905 0.101049 2.928 0.003523 **
atndrte:priGPA 0.005586 0.004317 1.294 0.196173
Residual standard error: 0.8729 on 673 degrees of freedom
Multiple R-squared: 0.2287, Adjusted R-squared: 0.2218
F-statistic: 33.25 on 6 and 673 DF, p-value: < 1e-04
\(\beta_atndrte = -0.006713\), implying that lower attendance rates are good for final exam scores. But \(\beta_{atndrte,priGPA} = 0.005586\) which tells us that for a prior GPA above 1.2 there is a positive affect of attendance overall.
The coefficients are statistically insignificant, but that's probably just due to multicollinearity (since both terms include atndrte).
To get a better idea of what's going on, we should us an F test. Here we're restricting \(\beta_atndrte\) and \(\beta_{atndrte,priGPA}\) to both equal 0 (which we do by leaving them out of the restricted model).
fit3.r <- lm(stndfnl ~ priGPA + ACT + ACT2 + priGPA2, attend) print(anova(fit3,fit3.r))
Analysis of Variance Table Model 1: stndfnl ~ atndrte * priGPA + ACT + ACT2 + priGPA2 Model 2: stndfnl ~ priGPA + ACT + ACT2 + priGPA2 Res.Df RSS Df Sum of Sq F Pr(>F) 1 673 512.76 2 675 519.34 -2 -6.5814 4.319 0.01368 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Jointly, these variables are statistically significant with a p-value of 0.014!
Even though individually they're statistically insignificant, we shouldn't be so ready to leave them out of the regression!
data.dm <- attend %>% mutate_each(funs(scale(.,scale=FALSE))) fit3.beta <- lm(stndfnl ~ atndrte*priGPA + ACT + ACT2 + priGPA2, data.dm) print(summary(fit3.beta),concise = TRUE)
Call: lm(formula = stndfnl ~ atndrte * priGPA + ACT + ACT2 + priGPA2,
data = data.dm)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.022125 0.037588 -0.589 0.55631
atndrte 0.007737 0.002633 2.938 0.00341 **
priGPA -1.172118 0.529975 -2.212 0.02733 *
ACT -0.128039 0.098492 -1.300 0.19405
ACT2 0.004533 0.002176 2.083 0.03763 *
priGPA2 0.295905 0.101049 2.928 0.00352 **
atndrte:priGPA 0.005586 0.004317 1.294 0.19617
Residual standard error: 0.8729 on 673 degrees of freedom
Multiple R-squared: 0.2287, Adjusted R-squared: 0.2218
F-statistic: 33.25 on 6 and 673 DF, p-value: < 1e-04
When we demean the data our results changed! But only for the coefficients involved with the interaction term. And the change made interpretation easier
Call: lm(formula = stndfnl ~ atndrte * priGPA + ACT + ACT2 + priGPA2,
data = attend)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.050293 1.360319 1.507 0.132225
atndrte -0.006713 0.010232 -0.656 0.512005
priGPA -1.628540 0.481003 -3.386 0.000751 ***
ACT -0.128039 0.098492 -1.300 0.194047
ACT2 0.004533 0.002176 2.083 0.037634 *
priGPA2 0.295905 0.101049 2.928 0.003523 **
atndrte:priGPA 0.005586 0.004317 1.294 0.196173
Residual standard error: 0.8729 on 673 degrees of freedom
Multiple R-squared: 0.2287, Adjusted R-squared: 0.2218
F-statistic: 33.25 on 6 and 673 DF, p-value: < 1e-04
Increasing \(R^2\) is easy… just over-fit. Add any variable you can find and throw it into your regression.
This means \(R^2\) is only a good measure of goodness of fit when we're dealing with honest and alert researchers (who here didn't do their last homework assignment in the middle of the night?).
Recall:
\[R^2 = \frac{SSE}{SST} = 1 - \frac{SSR}{SST}\]
\[R^2 = 1 - \frac{SSR/n}{SST/n}\]
Dividing numerator/denominator by \(n\) doesn't change anything, but makes it look a bit more like the population variance.
For sample \(R^2\) we want to do something similar to estimates of sample variance.
Since we're estimating \(k + 1\) coefficients (each \(x\) variable and the intercept term) we should adjust our sum of squared residuals (SSR) accordingly.
For total variation (SST) we'll follow the example of sample variance and adjust by 1 (since SST is relative to an estimated sample mean… i.e. there's one parameter being estimated).
\[\bar{R}^2 = 1 - \frac{SSR/(n-k-1)}{SST/(n-1)}\]
Strictly speaking, this doesn't get us closer to knowing anything about the true/underlying/population variation.
fit3 and fit3.r. The latter was a special case of the former… it was fit3, but with specific values for some of the slope coefficients (specifically \(\beta_i = 0\))
?car::linearHypothesis for details about testing other restrictions… skip down to the examples for something that makes sense)fit3.r was a nested model… it fits inside of fit3.