Author: Muhammad Minhaj Akhtar
Designation: Lecturer Economics
College: Government Graduate College Jauharabad
Regression analysis is one of the commonly used tools in econometrics. But what is regression analysis? The dictionary meaning of regression is “backward movement”, or “return to an earlier stage of development”. In fact, regression analysis as it is currently used has nothing to do with regression as dictionaries define the term. The term regression was coined by Frnacis Galton from England who was studying the relationship between heigh of children and the height of parents. He found that although tall parents had tall children and short parents had short children, there was a tendency for children`s height toward the average. Galton termed this a “regression toward mediocrity”.
Regression analysis is concerned with the study of the dependence of one variable, the dependent variable, on one or more other variables, the explanatory variables. The purpose of regression analysis is to estimate and predict the population mean value of the dependent variable on the basis of known or fixed values of the independent variable. In other words, finding the expected value of the dependent variable E(Y) conditional on the given values of independent variables. That is E (Y | X). (D.N. Gujarati)
A regression equation is a mathematical equation that is used to predict the values of one dependent variable from known or given values of independent variable(s).
The four main objectives of regression analysis and the questions addressed are summarized in table 1
## Warning: package 'knitr' was built under R version 4.4.2
## Warning: package 'kableExtra' was built under R version 4.4.2
| Regression Objective | Generic Types of Questions Addressed |
|---|---|
| Estimating causal effects | How does a certain factor affect the outcome? |
| Determining how well certain factors predict the outcome | Does a certain factor predict the outcome? What factors predict the outcome? |
| Forecasting an outcome | What is the best prediction/forecast of the outcome? |
| Adjusting outcomes for various (non-performance) factors | How well did a subject perform relative to what we would expect given certain factors? |
The linear regression model shows the linear dependence of one variable on one or more independent variables. A simple linear regression model consists of linear dependence of one variable on only one independent variable. It is also called bivariate or two variable regression model. Such as dependence of consumption on disposable income. A multiple linear regression model consists of linear dependence of one variable on two or more independent variables. It is also called multivariate regression model. For example, crop yield depends on rainfall, temperature, sunshine, fertilizer etc.
The linear regression model shows the linear
dependence of one variable on one or more independent variables. A
simple linear regression model consists of linear
dependence of one variable on only one independent variable. It is also
called bivariate or two variable model. Such as dependence of
consumption on disposable income. A multiple linear regression
model shows the linear dependence of one variable on more than
one independent variable. It is also called multivariate regression
model. Such as crop-yield linearly depends on rainfall, temperature,
sunshine, and fertilizer.
A simple linear regression model can be written as:
\[Y_i=β_0+β_1X_1i+u_i\]
It has five components:
1. The dependent variable is the variable that you are
trying to explain or predict. It`s value depends on values of other
variables. It is also called outcome variable, response variable,
Y-variable or explained variable. For example in our crop yield
regression crop yield is the dependent variable.
2. The independent variable is a variable that explains
the variations in the dependent variable. It is also known as
explanatory variable, regressor, treatment variable, X-variable. For
example in regression of consumption on income, income is an explanatory
variable.
3. The coefficient on the explanatory variable (β1),
which indicates the slope of the regression line, It measures one-unit
effect of X-variable on Y-variable. For example, if β1=0.6, it means
that a one unit increase in income is associated with 0.6 units increase
in consumption.
4. The intercept term (β0), indicates the Y-intercept
from the regression line, or the expected value of Y when X = 0. This is
sometimes called the “constant” term.
5. The error term (ui) indicates how far off an
individual data point is, vertically, from the true regression line.
This occurs because regressions typically cannot perfectly predict the
outcome.
Regression is an important toolkit for econometricians to estimate the relationship between economic variables. For example. 1. An economist may be interested in studying the dependence of personal consumption expenditure on after tax or disposable real personal income. It gives us the value of MPC. 2. A monopolist who can fix the price or output (but not both) may want to find out the response to the demand for a change in price. It gives us price elasticity of demand. 3. A labor economist may want to study the impact of rate of change of money wages on the unemployment rate. 4. To examine the relationship between the amount of money, as a proportion of their income, that people would want to hold at various rates of inflation. (Money demand function) 5. The marketing director is interested in estimating the elasticity of demand w.r.t advertising expenditure. 6. Finally, an agronomist may be interested in studying the dependence of a particular crop yield, say, of wheat, on temperature, rainfall, amount of sunshine, and fertilizer.
Deterministic relationships between variables imply exact or
nonrandom relationships which exist in mathematical models, if we plot
graphs of these relationships then all data points will lie exactly on
the line or curve.
\[Y=β_0+β_1 X\]
Stochastic relationships between variables imply inexact or random relationships. If we plot graphs of these models some data points lie above the line, on the line, or below the line. Econometric models are examples of stochastic relationships. \[Y=β_0+β_1 X+ε\]
| Aspect | Conditional Mean | Unconditional Mean |
|---|---|---|
| Definition | It is the expected value of Y given the fixed values of X. | It is the expected value of Y, but this is not based on the X values. |
| Consideration of X | X variable is included | X variable is disregarded |
| Notation | E (Y | X) | E (Y) |
| Example | What is the average weekly consumption of a family having a particular income level. | What is the average weekly consumption of a family. |
Regression analysis is concerned with the dependence of one variable on other variable(s), it does not necessarily imply causation (estimating cause and effect relationship). In the words of Kendall and Stuart, “A statistical relationship, however strong can never establish causal connection: causation must come from outside statistics, ultimately from some theory or other”. For example, if crop yield depends on rainfall, then there is no statistical reason to assume that crop yield does not affect rainfall, obviously common sense implies that the reverse cannot happen.
| Regression Analysis | Correlation Analysis |
|---|---|
| The purpose of regression analysis is to estimate or predict the average value of the dependent variable on the given values of independent variables. | The purpose of correlation analysis is to measure the strength of linear association between two variables. |
| There is an asymmetry i.e. there is a distinction between the dependent and independent variables. | We treat any (two) variables symmetrically i.e. no distinction between the dependent and independent variables. |
| Y is random, X has given or fixed values | Both variables are assumed to be random |
| Dependent Variable | Explanatory Variable |
|---|---|
| Dependent variable | Explanatory variable |
| Explained variable | Independent variable |
| Predictand | Predictor |
| Regressand | Regressor |
| Response | Stimulus |
| Endogenous | Exogenous |
| Outcome | Covariate |
| Controlled variable | Control variable |
Population refers to the entire group of individuals
or entities about which inference is to be made. For example, if a
researcher wants to study the average consumption and income of all
households in a city (suppose 60), the population will consist of every
household in that city.
Sample is a subset of population that is selected to
represent entire population. For example, a researcher, rather than
collecting consumption data on all 60 households, selects 20 households
randomly to make an inference about the average consumption and income
of those 60 households.
Remember the purpose of regression analysis is to estimate or predict the population mean value of Y, on the basis of known or fixed values of X, that is to know the average consumption of 60 households given their incomes. To understand this, we take the data about consumption and income of 60 households given in table 1. These 60 households are divided into 10 income groups.
# Load necessary library
library(knitr)
# Create a data frame with the given data
data <- data.frame(
Income = c(80, 100, 120, 140, 160, 180, 200, 220, 240, 260),
`Cons_Fam_1` = c(55, 65, 79, 80, 102, 110, 120, 135, 137, 150),
`Cons_Fam_2` = c(60, 70, 84, 93, 107, 115, 136, 137, 145, 152),
`Cons_Fam_3` = c(65, 74, 90, 95, 110, 120, 140, 140, 155, 175),
`Cons_Fam_4` = c(70, 80, 94, 103, 116, 130, 144, 152, 165, 178),
`Cons_Fam_5` = c(75, 85, 98, 108, 118, 135, 145, 157, 175, 180),
`Cons_Fam_6` = c(NA, 88, 113, 125, 140, NA, NA, 160, 189, 185),
`Cons_Fam_7` = c(NA, NA, 115, NA, NA, NA, NA, 162, NA, 191),
Total = c(325, 462, 445, 707, 678, 750, 685, 1043, 966, 1211),
`Conditional_Mean` = c(65, 77, 89, 101, 113, 125, 137, 149, 161, 173)
)
# Print the table with formatting
kable(data, caption = "Weekly Family Income and Consumption Expenditure") %>%
kable_styling(full_width = F) %>%
row_spec(0, bold = TRUE) %>%
column_spec(1, bold = TRUE) %>%
column_spec(10, bold = TRUE)
| Income | Cons_Fam_1 | Cons_Fam_2 | Cons_Fam_3 | Cons_Fam_4 | Cons_Fam_5 | Cons_Fam_6 | Cons_Fam_7 | Total | Conditional_Mean |
|---|---|---|---|---|---|---|---|---|---|
| 80 | 55 | 60 | 65 | 70 | 75 | NA | NA | 325 | 65 |
| 100 | 65 | 70 | 74 | 80 | 85 | 88 | NA | 462 | 77 |
| 120 | 79 | 84 | 90 | 94 | 98 | 113 | 115 | 445 | 89 |
| 140 | 80 | 93 | 95 | 103 | 108 | 125 | NA | 707 | 101 |
| 160 | 102 | 107 | 110 | 116 | 118 | 140 | NA | 678 | 113 |
| 180 | 110 | 115 | 120 | 130 | 135 | NA | NA | 750 | 125 |
| 200 | 120 | 136 | 140 | 144 | 145 | NA | NA | 685 | 137 |
| 220 | 135 | 137 | 140 | 152 | 157 | 160 | 162 | 1043 | 149 |
| 240 | 137 | 145 | 155 | 165 | 175 | 189 | NA | 966 | 161 |
| 260 | 150 | 152 | 175 | 178 | 180 | 185 | 191 | 1211 | 173 |
Data in table represents the whole population where Y is weekly consumption expenditure and X weekly income. Economic theory suggests that consumption expenditure increases with the increase in income. From table we see that there is considerable variation in weekly consumption expenditure in each income group. For example, all households with weekly income of $80 (in above table 1 there are five) have weekly consumption ranges from $55 to $75, similarly households with weekly income $100 (in above table 1 there are six) have weekly consumption ranges from $65 to $88. Moreover, there are also considerable variations in consumption across the groups, but note that on average, weekly consumption expenditure increases as income increases (see last row). In other words, households with higher level of income have higher consumption levels. For example, average weekly consumption of households whose income is $80 is $65, and average weekly consumption of households whose income is $160 is $113.
Figure 1 shows the expected values of weekly consumption Y, at various levels of Income. The circles show the mean values of Y (consumption) at each value of X (income). Remember that at each income level say $80, consumption can take any value within its probability distribution as shown by dark lines in figure 1. That`s why dependent variable is random. In other words, all households with income $80 cannot necessarily have consumption expenditures of $65, it can be above or below the 65 USD but their average consumption is $65. This is the purpose of regression analysis to predict or estimate the average value of Y, given various values of X.
Population Regression Function
Population regression function shows the functional relationship
between conditional expected value of dependent variable E(Y|X) given
the known or fixed values of independent variable Xi. It is also called
conditional expected function.
\[E(Y|X)=f(X_i)\]
Assuming that consumption function is linear so we can write our population regression function as:
\[E(Y│Xi)=β_0+β_1 X_i\]
where β1 and β2 are unknown but fixed parameters known as the regression coefficients; β1 and β2 are also known as intercept and slope coefficients, respectively.
Population Regression Line
Population regression line or population regression curve is the
combination of conditional means values of dependent variable Y, for
each fixed or given values of independent variable. More simply, it is
the curve connecting the means of the sub populations of Y corresponding
to the given values of the regressor X.
Remember that an econometric model is a set of behavioral equations
which represent relationship between economic vairables. It consists of
some observed variables, and some unobserved variables. The observed
variables are those that are included in the model often called
independent variables or explanatory variables which explain the
variations in Y, the dependent variable. For example, in our consumption
function for example Consumption is a dependent variable whose variation
we try to explain, and income is an independent variable who explains
the variation in Consumption. There is not only one factor that affects
consumption. It can be seen from figure 1 that corresponding to each
income level consumption of each family is clustered around the mean
consumption.
It can be measured from the vertical distance between population
regression line (Solid straight line) which shows conditional mean
consumption E(Y|Xi) and individual family consumption, Yi. Thus, \[u_i =Yi-E(Y|Xi)\]
\[Y_i=E(Y│X_i )+u_i\]
Substituting the value of E(Y│X_i ) in last equation which is β_0+β_1 Xi
\[Y_i=β_0+β_1 X_i+u_i\]
Here u_i shows deviation of each family`s consumption from conditional
mean consumption. It can be positive or negative. Technically it is
called stochastic error term or disturbance term. This equation shows
that consumption expenditure of each family is linearly related to its
income plus a disturbance term. It means there are family specific
factors that affect the consumption of each family. For example, family
size, a large family has higher consumption level even though the income
of that family is less than the income of small family. Even though we
include family size in the model, there is some variation left that is
not explained by income and family size. There are a lot of factors that
we cannot include in the model either due to our limited knowledge about
these factors or we cannot measure them. Even though if we include all
possible factors in the model, there might be some measurement errors in
observed variables itself. For example, income, no one share his/her
exactly personal income. These are called measurement errors. Moreover,
the purpose of regression analysis is to estimate the average
consumption behavior of all families not he consumption for each family.
Thus, we include a random error term in our population regression
function which is the proxy of all those omitted or neglected factors
other than X, that affect Y.
Study of whole population is difficult as it is time consuming, energy consuming, and resource consuming. That`s why we instead of studying whole population we take a sample of this population which is the representative of whole population. Thus, in regression our task is to estimate the population regression function on the basis of sample regression function. In fact, we can draw “N” number of random samples and each random sample is not likely to be the same. In PRF our purpose is to find the average weekly consumption on the basis of given or fixed values of income. In SRF our task is to predict or estimate the average monthly consumption Y in the population as a whole corresponding to the chosen X. Remember that we cannot accurately forecast PRF using SRF because of sampling fluctuations. Because we can draw “N” samples from a given population there will be “N” SRF. Each SRF will provide a different estimate of population parameters. Suppose that we took two random samples from a population of 60 families and draw sample regression line for each sample SRL1 and SRL2. Which SRL is true representative of PRL. There is no way we can be absolutely sure that either of the regression lines shown in Figure 3 represents the true population regression line.
Sample regression function can be written as \[\hat{Y_i} = \hat{\beta_0} + \hat{\beta_1}X_i\]
where “Y-hat’’ or “Y-cap’’ which is the estimator of E(Y | Xi) (β_0 ) ̂= estimator of β0 (β_1 ) ̂= estimator of β1 We can also write our SRF in another form. \[Y_i = \hat{\beta_0} + \hat{\beta_1}X_i + \hat{u_i}\]
(u_i ) ̂ is considered as an estimate of u_i. Estimator and Estimate. Estimator is a rule or formula or method that tells us how to estimate the population parameter from the information provided by the sample. A particular numerical value obtained by the estimator is called estimate. Estimator is random whereas estimate is nonrandom. Looking forward… To sum up, then, we find our primary objective in regression analysis is to estimate the PRF on the basis of the SRF \[Y_i=β_0+β_1 X_i+u_i\] \[Y_i = \hat{\beta_0} + \hat{\beta_1}X_i + \hat{u_i}\]
We know that we can have as many SRF as number of random samples, so our question is which SRF best approximate the PRF. Can we devise a rule or a method that will make this approximation as “close” as possible? In other words, how should the SRF be constructed so that (β_0 ) ̂ is as “close” as possible to the true β1 and (β_1 ) ̂ is as “close” as possible to the true β1 even though we will never know the true β0 and β1? The answer is yes, and this method is called Ordinary Least Square Method which minimizes the sum of squared residuals.