Remember:
- Fitted value is an estimate of the
dependent variable
- Residuals are the differences
between the actual and fitted value of the dependent variable
- Homoskedasticity - variability is
constant
- Heteroskedasticity - variability
isn’t constant
- t-value formula: (b-beta):Se
- Adjusted R-squared is used only for
comparison purposes
- Interactions must be parallel only
visually. We assume it’s enough to be sure they are
- Coefficient of interaction shows
the difference in slopes (form: b3XGender)
- Interaction in model’s output: if
there is interaction included into model, the numeric variable which
interacts with the dummy one, will show in its own line the regression
coefficient, which explains its influence for the reference category of
the dummy variable. F.e.: interaction between weight and gender; in its
own line weight will explain its influence for the reference category of
the gender (but it doesn’t change anything in the explanation of this
coefficient)
- There are only two reasons for logarithmization of a variable. We either
want to check the influence of RELATIVE increase/decrease of an
exp.variable on dependent one or there is no linear relationship between
dependent variable and an explanatory one, which we logarithmize.
Assumptions
- Linear correlation in parameters (linear
relationship between Y and Xj)
- Remember: no dummy variables
- Violation:
- Solution:
- Run the OLS regression already now
- Scatterplot of the standardized residuals and standardized fitted
values (to check high stand. residuals). Run the corresponding code to
create fitted values first
- transform the problematic explanatory variables with logarithm
- The scatterplot matrix again
- The expected value of errors equals 0
- Remember: this means there are no forgotten variables with
considerable influence on the dependent variable (in practical science,
check previous studies).
- Violation: the test of partial regression coefficient is
impossible due to the biased estimated coefficient (check the formula
for calculation of t-value in such test)
- Homoscedasticity of errors
- Remember: Linearity between dep. and exp. Variables can be
seen here too (not important)
- Violation: the standard errors in formula of t-value are
biased.
- Solution: White’s robust standard errors.
- Normal distribution of errors
- Remember: Since we don’t have the population data, we work
with residuals. So, any decision about errors we make, are possible only
through hypothesis testing
- Violation: In this case the test of partial regression
coefficients isn’t possible
- Solution: It’s not problematic, if the sample is big
enough
- Errors are independent
- Remember: Cov(Ei,Ej) = 0.
- Violation: standard errors in formula of t-value is
biased.
- Solution: multilevel regression, econometrics or hierarchy
regression (do not care)
- No perfect multicolinearity
- Violation: Parameters of the regression model cannot be
estimated.
- Solution: Exclude one of the explanatory variables
- The number of units (sample size) is bigger than number of
estimated parameters n > k
- Violation: parameters cannot be estimated
!If these assumptions are met, the Gauss-Mark theorem defines the OLS
regression as Best Linear Unbiased Estimator (BLUE)
Ordinary Least Squares method (OLS)
Additional requirements
- Dependent variable is numeric; explanatory variables are
either numeric or dummy
- Each explanatory variable varies, and it’s
desirable, that these variables posses the widest possible range of
variability
- Outliers and units with big impact are excluded
among standardized residuals
- Remember: Outlier is a value of dependent variable with a
big residual. Unit with big impact changes the slope a lot.
- Violation: drop the violating units.Re-run the OLS
regression, if any observations were excluded. Regarding the droping
process: drop it either by ID or by f.e. Cook’s distance greater than…
.
- No overly strong multicolinearity
- Violation: Inflated standard errors can result in
impossibility of explanation capability of model.
- Solution: drop the variable
Analysis
- Run the summary of the regression and get an
output
- Check the p-value.
- The corresponding test is called ‘test of partial regression
coefficient’
- Interpret the results
- Check the estimates in the middle.
- Complete the (estimated) regression function with its help.
- Interpret the function.
- Pay attention on the coefficient of determination
(R-squared) below.
- If we have only one explanatory variable, it’s simple R-squared. If
two or more, it is multiple R-squared. Its value varies between 0 and
+1. It shows how many percents of variability of the dependent variable
is explained by linear effect of the explanatory variables. Interpret
the value.
- Pay attention to the last line of the output with the
F-Test.
- It is the test of significance of the regression model
(‘roh^2’).
*roh^2 is the population coefficient of determination
! Regarding two last points: Everything is reasonable. If the
coefficient of determination is quite high, it explains much of
variability of the dependent variable. And then roh^2 > 0 (H1), which
means the model is so to speak good and appropriate. On the other hand,
if the coefficient of determination is low, then roh^2 = 0 and the model
barely explains anything, so it isn’t significant. R^2 stands for the
sample. Meanwhile roh^2 refers to the population. But it’s important,
that in case of roh^2 we cannot tell which proportion of the variable’s
variability can be explained by the model, we only say that roh^2 is
positive.
If different models are to compare
Post-analysis
- Look at multiple (because it is between all explanatory variables)
correlation (squared root from R-squared) coefficient! We never talk
about positive or negative correlation, when estimating
multiple correlation.
- It tells how strong is relationship between all exp. variables and
dependent one.
- Check standardized partial regression coefficients (lm.beta()). We
need to standardize them, because something is proportion, something is
percentage.
- The higher the absolute value, the bigger is the impact of the
variable
2nd regression analysis
- Run full OLS regression and summary of it.
Part of comparison of both models and conclusion
When making conclusions:
- Don’t forget to mention ‘on
average’ and ‘assuming all other
variables remain unchanged’ (ceteris paribus condition)
- If the units of explanatory variable are defined as %, then in
explanation say about increase for 1
Pp
- When the explanatory variable is dichotomous, don’t forget to
mention ‘in comparison to {another
category}’
- Additionally, when the explanatory variable is dichotomous, the
ceteris paribus part is as follows: given the
value(s) of {all other explanatory variables}
- lin-log model: assuming all other
variables remain unchanged, if {explanatory variable} increases by 1%,
the {dependent variable} increases on average by ‘b’:100
- coefficient of determination R^2:
if it’s needed to name the influencing explanatory variable, we mention
them in the form they influence. And so, it also relates to f.e.
‘natural logarithm of {explanatory variable}’
- Interactions:
- num.exp.var. alone: assuming all
other variables remain unchanged, if the {num.exp.var.} for {ref. cat.
of dummy var.} increases by 1, the {dep.var.} on average increases by
‘b’
- Interaction of num.exp.var. and category
of dummy var.: assuming all other variables remain unchanged, if
the num.exp.var. for {category of dummy var.} increases by 1, the
{dep.var.} on average increases by ‘b’ MORE, than for {ref.cat}