Linear regression presentation

Linear regression

  • This means one unit increase in x leads to beta1 increase in y
  • Beta 0 is the intercept: if beta1 is zero y is beta1

LINEAR REGRESSION ASSUPMTIONS

  • Variable type: outcome must be continuous. Predictors can be continuous or dichotomous
  • Non-zero variance: Predictors MUST NOT HAVE zero variance
  • Independent errors: For any pair of observation, the error terms are uncorrelated
  • Linear relationship between x and y (check with scatterplot)
  • No or little multicollinearity: Predictors x1, x2.. xn must not be highly correlated - corr ()
  • Homoscedasticity: For each value of the predictors the variance of the error term should be constant
  • Normally-distributed Errors

**A note about sample size**

In Linear regression the sample size rule of thumb is that the regression analysis requires at least 20 or 30 cases/observations per independent variable in the analysis.

ASSESSING MULTICOLINEARITY

  • Correlation matrices: corr()

  • Variance Inflation Factor (VIF): - The inverse of the tolerance statistic (higher values indicate that a predictor is redundant): Should be less then 10 or 20. In R vif(lmodel) is in the car package

  • Tolerance statistic: The % of variance in the independent variable that cannot be accounted for by the other predictors (smaller values indicate that a predictor is redundant). It should be at least higher than 0.1 or 0.2
    vif_values <- vif(model)
    tolerance <- 1 / vif_values
    tolerance

ASSESSING MODEL FIT (OUTLIERS)

  • Standardized residuals:

    • 95% ofstandardized residuals should lie between +-2
    • 99% should lie between +-2.5
    • standardized residuals of 3 or more are outliers
      rstandard(lmmodel)
  • Cook's distance: measures the influence of a single case on the model as a whole

    • values higher than 4/n may be cause for concern plot(lmmodel,which=4,id.n = 5)

ASSESSING ASSUMPTIONS ABOUT ERRORS

  • Homoscedasticity/indipendence of errors:
    • residual versus fittet plot; plots standardized residuals against standardized predicted values

plot(lmmodel,which=1) which=1 odnosi se na residuals vs fitted

  • ovjde istrazujemo homoscedasticity or heteroscedasity
  • gledamo linear concept of data. Ako je linija zakrivljena nije linearni model.
  • ovaj plot mora biti homeoscedastic.
  • na ovom plotu vidimo i outlayere. Izbaci nam broj reda i onda odlucimo da li trebmo da ga izbacimo.

Assumption: Normaly distributed error

residuals <- resid(lmmodel)

QQ plot

qqnorm(residuals) qqline(residuals, col = "red")

Shapiro-Wilk test

shapiro.test(residuals)

Linear regression output

confint(model, level = 0.90)

R_Squared_Computation residuals

  • R2 means how much variance is explained with the model..
  • if it is 0.658 it means that are model is 65.8% explained
  • the grater the better?!
  • No, if we have more variables in the model our R2 will increase
  • Then, we look at Adjusted R2

Example of non linear relationship

plot (x,y)

The correlation for this plot is zero, this is why we need to plot data not just calculate corr