Linear regression presentation

Linear regression

  • This means one unit increase in x leads to beta2 increase in y
  • Beta 1 is the intercept: if beta2 is zero y is beta1

LINEAR REGRESSION ASSUPMTIONS

  • Variable type: outcome must be continuous. Predictors can be continuous or dichotomous
  • Non-zero variance: Predictors MUST NOT HAVE zero variance
  • Independent errors: For any pair of observation, the error terms are uncorrelated
  • Linear relationship between x and y (check with scatterplot)
  • No or little multicollinearity: Predictors x1, x2.. xn must not be highly correlated - corr ()
  • Homoscedasticity: For each value of the predictors the variance of the error term should be constant
  • Normally-distributed Errors

A note about sample size:

In Linear regression the sample size rule of thumb is that the regression analysis requires at least 20 or 30 cases/observations per independent variable in the analysis.

ASSESSING MULTICOLINEARITY

  • Correlation matrices: corr()
  • Tolerance statistic: The % of variance in the independent variable that cannot be accounted for by the other predictors (smaller values indicate that a predictor is redundant). It should be at least higher than 0.1 or 0.2

  • Variance Inflation Factor (VIF): - The inverse of the tolerance statistic (higher values indicate that a predictor is redundant): Should be less then 10 or 20. In R vif(lmodel) is in the car package

ASSESSING MODEL FIT (OUTLIERS)

  • Standardized residuals:
    • 95% ofstandardized residuals should lie between +-2
    • 99% should lie between +-2.5
    • standardized residuals of 3 or more are outliers
  • Cook's distance: measures the influence of a single case on the model as a whole
    • values higher than 4/n may be cause for concern
  • Cook's distance in R: plot(lmmodel,which=4,id.n = 5)

ASSESSING ASSUMPTIONS ABOUT ERRORS

  • Homoscedasticity/indipendence of errors:
    • residual versus fittet plot; plots standardized residuals against standardized predicted values

plot(lmmodel,which=1) which=1 odnosi se na residuals vs fitted

  • ovjde istrazhujemo homoscedasticity or heteroscedasity
  • gledamo linear concept of data. Ako je linija zakrivljena nije linearni model.
  • ovaj plot mora biti homeoscedastic.
  • na ovom plotu vidimo i outlayere. Izbaci nam broj reda i onda odlucimo da li trebmo da ga izbacimo.

Linear regression output

R_Squared_Computation residuals

  • R2 means how much variance is explained with the model..
  • if it is 0.658 it means that are model is 65.8% explained
  • the grater the better?!
  • No, if we have more variables in the model our R2 will increase
  • Then, we look at Adjusted R2

Example of non linear relationship

plot (x,y)

The correlation for this plot is zero, this is why we need to plot data not just calculate corr