Remember:

Fitted value is an estimate of the dependent variable
Residuals are the differences between the actual and fitted value of the dependent variable
Homoskedasticity - variability is constant
Heteroskedasticity - variability isn’t constant
t-value formula: (b-beta):Se
Adjusted R-squared is used only for comparison purposes
Interactions must be parallel only visually. We assume it’s enough to be sure they are
Coefficient of interaction shows the difference in slopes (form: b3XGender)
Interaction in model’s output: if there is interaction included into model, the numeric variable which interacts with the dummy one, will show in its own line the regression coefficient, which explains its influence for the reference category of the dummy variable. F.e.: interaction between weight and gender; in its own line weight will explain its influence for the reference category of the gender (but it doesn’t change anything in the explanation of this coefficient)
There are only two reasons for logarithmization of a variable. We either want to check the influence of RELATIVE increase/decrease of an exp.variable on dependent one or there is no linear relationship between dependent variable and an explanatory one, which we logarithmize.

Assumptions

1. Linear correlation in parameters (linear relationship between Y and Xj)
- Remember: no dummy variables
- Violation:
- Solution:
  - Run the OLS regression already now
  - Scatterplot of the standardized residuals and standardized fitted values (to check high stand. residuals). Run the corresponding code to create fitted values first
  - transform the problematic explanatory variables with logarithm
  - The scatterplot matrix again
1. The expected value of errors equals 0
- Remember: this means there are no forgotten variables with considerable influence on the dependent variable (in practical science, check previous studies).
- Violation: the test of partial regression coefficient is impossible due to the biased estimated coefficient (check the formula for calculation of t-value in such test)
1. Homoscedasticity of errors
- Remember: Linearity between dep. and exp. Variables can be seen here too (not important)
- Violation: the standard errors in formula of t-value are biased.
- Solution: White’s robust standard errors.
1. Normal distribution of errors
- Remember: Since we don’t have the population data, we work with residuals. So, any decision about errors we make, are possible only through hypothesis testing
- Violation: In this case the test of partial regression coefficients isn’t possible
- Solution: It’s not problematic, if the sample is big enough
1. Errors are independent
- Remember: Cov(Ei,Ej) = 0.
- Violation: standard errors in formula of t-value is biased.
- Solution: multilevel regression, econometrics or hierarchy regression (do not care)
1. No perfect multicolinearity
- Violation: Parameters of the regression model cannot be estimated.
- Solution: Exclude one of the explanatory variables
1. The number of units (sample size) is bigger than number of estimated parameters n > k
- Violation: parameters cannot be estimated

!If these assumptions are met, the Gauss-Mark theorem defines the OLS regression as Best Linear Unbiased Estimator (BLUE)

Ordinary Least Squares method (OLS)

Additional requirements

1. Dependent variable is numeric; explanatory variables are either numeric or dummy
1. Each explanatory variable varies, and it’s desirable, that these variables posses the widest possible range of variability
1. Outliers and units with big impact are excluded among standardized residuals
- Remember: Outlier is a value of dependent variable with a big residual. Unit with big impact changes the slope a lot.
- Violation: drop the violating units.Re-run the OLS regression, if any observations were excluded. Regarding the droping process: drop it either by ID or by f.e. Cook’s distance greater than… .
1. No overly strong multicolinearity
- Violation: Inflated standard errors can result in impossibility of explanation capability of model.
- Solution: drop the variable

Analysis

1. Run the summary of the regression and get an output
1. Check the p-value.
- The corresponding test is called ‘test of partial regression coefficient’
- Interpret the results
1. Check the estimates in the middle.
- Complete the (estimated) regression function with its help.
- Interpret the function.
1. Pay attention on the coefficient of determination (R-squared) below.
If we have only one explanatory variable, it’s simple R-squared. If two or more, it is multiple R-squared. Its value varies between 0 and +1. It shows how many percents of variability of the dependent variable is explained by linear effect of the explanatory variables. Interpret the value.
1. Pay attention to the last line of the output with the F-Test.
- It is the test of significance of the regression model (‘roh^2’).

*roh^2 is the population coefficient of determination

! Regarding two last points: Everything is reasonable. If the coefficient of determination is quite high, it explains much of variability of the dependent variable. And then roh^2 > 0 (H1), which means the model is so to speak good and appropriate. On the other hand, if the coefficient of determination is low, then roh^2 = 0 and the model barely explains anything, so it isn’t significant. R^2 stands for the sample. Meanwhile roh^2 refers to the population. But it’s important, that in case of roh^2 we cannot tell which proportion of the variable’s variability can be explained by the model, we only say that roh^2 is positive.

If different models are to compare

Post-analysis

1. Look at multiple (because it is between all explanatory variables) correlation (squared root from R-squared) coefficient! We never talk about positive or negative correlation, when estimating multiple correlation.
- It tells how strong is relationship between all exp. variables and dependent one.
1. Check standardized partial regression coefficients (lm.beta()). We need to standardize them, because something is proportion, something is percentage.
- The higher the absolute value, the bigger is the impact of the variable

2nd regression analysis

Run full OLS regression and summary of it.

Part of comparison of both models and conclusion

Conduct the ANOVA test

When making conclusions:

Don’t forget to mention ‘on average’ and ‘assuming all other variables remain unchanged’ (ceteris paribus condition)
If the units of explanatory variable are defined as %, then in explanation say about increase for 1 Pp
When the explanatory variable is dichotomous, don’t forget to mention ‘in comparison to {another category}’
Additionally, when the explanatory variable is dichotomous, the ceteris paribus part is as follows: given the value(s) of {all other explanatory variables}
lin-log model: assuming all other variables remain unchanged, if {explanatory variable} increases by 1%, the {dependent variable} increases on average by ‘b’:100
coefficient of determination R^2: if it’s needed to name the influencing explanatory variable, we mention them in the form they influence. And so, it also relates to f.e. ‘natural logarithm of {explanatory variable}’
Interactions:
- num.exp.var. alone: assuming all other variables remain unchanged, if the {num.exp.var.} for {ref. cat. of dummy var.} increases by 1, the {dep.var.} on average increases by ‘b’
- Interaction of num.exp.var. and category of dummy var.: assuming all other variables remain unchanged, if the num.exp.var. for {category of dummy var.} increases by 1, the {dep.var.} on average increases by ‘b’ MORE, than for {ref.cat}

REGRESSION ANALYSIS

OLEH BURDUKOV

2025-04-01