Linear regression presentation

Linear regression

This means one unit increase in x leads to beta1 increase in y

Beta 0 is the intercept: if beta1 is zero y is beta1

LINEAR REGRESSION ASSUPMTIONS

Variable type: outcome must be continuous. Predictors can be continuous or dichotomous

Non-zero variance: Predictors MUST NOT HAVE zero variance

Independent errors: For any pair of observation, the error terms are uncorrelated

Linear relationship between x and y (check with scatterplot)

No or little multicollinearity: Predictors x1, x2.. xn must not be highly correlated - corr ()

Homoscedasticity: For each value of the predictors the variance of the error term should be constant

Normally-distributed Errors

A note about sample size

In Linear regression the sample size rule of thumb is that the regression analysis requires at least 20 or 30 cases/observations per independent variable in the analysis.

ASSESSING MULTICOLINEARITY

Correlation matrices: corr()

Variance Inflation Factor (VIF): - The inverse of the tolerance statistic (higher values indicate that a predictor is redundant): Should be less then 10 or 20. In R vif(lmodel) is in the car package

Tolerance statistic: The % of variance in the independent variable that cannot be accounted for by the other predictors (smaller values indicate that a predictor is redundant). It should be at least higher than 0.1 or 0.2
vif_values <- vif(model)
tolerance <- 1 / vif_values
tolerance

ASSESSING MODEL FIT (OUTLIERS)

Standardized residuals:

95% ofstandardized residuals should lie between +-2

99% should lie between +-2.5

standardized residuals of 3 or more are outliers
rstandard(lmmodel)

Cook's distance: measures the influence of a single case on the model as a whole

values higher than 4/n may be cause for concern plot(lmmodel,which=4,id.n = 5)

ASSESSING ASSUMPTIONS ABOUT ERRORS

Homoscedasticity/indipendence of errors:

residual versus fittet plot; plots standardized residuals against standardized predicted values

plot(lmmodel,which=1) which=1 odnosi se na residuals vs fitted

ovjde istrazujemo homoscedasticity or heteroscedasity
gledamo linear concept of data. Ako je linija zakrivljena nije linearni model.
ovaj plot mora biti homeoscedastic.
na ovom plotu vidimo i outlayere. Izbaci nam broj reda i onda odlucimo da li trebmo da ga izbacimo.

Assumption: Normaly distributed error

residuals <- resid(lmmodel)

QQ plot

qqnorm(residuals) qqline(residuals, col = "red")

Shapiro-Wilk test

shapiro.test(residuals)

Linear regression output

confint(model, level = 0.90)

R_Squared_Computation residuals

R2 means how much variance is explained with the model..
if it is 0.658 it means that are model is 65.8% explained
the grater the better?!
No, if we have more variables in the model our R2 will increase
Then, we look at Adjusted R2

Example of non linear relationship

plot (x,y)

The correlation for this plot is zero, this is why we need to plot data not just calculate corr