Introduction to Regression Analysis

Author: Muhammad Minhaj Akhtar
Designation: Lecturer Economics
College: Government Graduate College Jauharabad

Origin of Regression Analysis

Regression analysis is one of the commonly used tools in econometrics. But what is regression analysis? The dictionary meaning of regression is “backward movement”, or “return to an earlier stage of development”. In fact, regression analysis as it is currently used has nothing to do with regression as dictionaries define the term. The term regression was coined by Frnacis Galton from England who was studying the relationship between heigh of children and the height of parents. He found that although tall parents had tall children and short parents had short children, there was a tendency for children`s height toward the average. Galton termed this a “regression toward mediocrity”.

Regression analysis definition

Regression analysis is concerned with the study of the dependence of one variable, the dependent variable, on one or more other variables, the explanatory variables. The purpose of regression analysis is to estimate and predict the population mean value of the dependent variable on the basis of known or fixed values of the independent variable. In other words, finding the expected value of the dependent variable E(Y) conditional on the given values of independent variables. That is E (Y | X). (D.N. Gujarati)

Regression Equation

A regression equation is a mathematical equation that is used to predict the values of one dependent variable from known or given values of independent variable(s).

Four main objectives for regression analysis

To quantify how one factor causally affects another. For example, if a person obtains one additional year of schooling, how much would we expect that person’s income to increase?
To forecast or predict an outcome.For example, policymakers may want to forecast inflation raate next year. They include various factors in regression model that help forecast inflation.
To determine the predictors of some factor. For example, parents or school counselors may want to know what factors could predict whether a teenager is using drugs.
To adjust an outcome for various factors. For example, rather than just evaluating a teacher’s effectiveness based on his/her students’ test scores, we could adjust those scores based on the students’ prior scores and perhaps the demographics and English-language status of the students.

The four main objectives of regression analysis and the questions addressed are summarized in table 1

## Warning: package 'knitr' was built under R version 4.4.2

## Warning: package 'kableExtra' was built under R version 4.4.2

Regression Objective	Generic Types of Questions Addressed
Estimating causal effects	How does a certain factor affect the outcome?
Determining how well certain factors predict the outcome	Does a certain factor predict the outcome? What factors predict the outcome?
Forecasting an outcome	What is the best prediction/forecast of the outcome?
Adjusting outcomes for various (non-performance) factors	How well did a subject perform relative to what we would expect given certain factors?

Linear Regression Model

The linear regression model shows the linear dependence of one variable on one or more independent variables. A simple linear regression model consists of linear dependence of one variable on only one independent variable. It is also called bivariate or two variable regression model. Such as dependence of consumption on disposable income. A multiple linear regression model consists of linear dependence of one variable on two or more independent variables. It is also called multivariate regression model. For example, crop yield depends on rainfall, temperature, sunshine, fertilizer etc.

Components of Linear Regression Model

The linear regression model shows the linear dependence of one variable on one or more independent variables. A simple linear regression model consists of linear dependence of one variable on only one independent variable. It is also called bivariate or two variable model. Such as dependence of consumption on disposable income. A multiple linear regression model shows the linear dependence of one variable on more than one independent variable. It is also called multivariate regression model. Such as crop-yield linearly depends on rainfall, temperature, sunshine, and fertilizer.
A simple linear regression model can be written as:
\[Y_i=β_0+β_1X_1i+u_i\]

It has five components:
1. The dependent variable is the variable that you are trying to explain or predict. It`s value depends on values of other variables. It is also called outcome variable, response variable, Y-variable or explained variable. For example in our crop yield regression crop yield is the dependent variable.
2. The independent variable is a variable that explains the variations in the dependent variable. It is also known as explanatory variable, regressor, treatment variable, X-variable. For example in regression of consumption on income, income is an explanatory variable.
3. The coefficient on the explanatory variable (β1), which indicates the slope of the regression line, It measures one-unit effect of X-variable on Y-variable. For example, if β1=0.6, it means that a one unit increase in income is associated with 0.6 units increase in consumption.
4. The intercept term (β0), indicates the Y-intercept from the regression line, or the expected value of Y when X = 0. This is sometimes called the “constant” term.
5. The error term (ui) indicates how far off an individual data point is, vertically, from the true regression line. This occurs because regressions typically cannot perfectly predict the outcome.

The Use of Regression Analysis

Regression is an important toolkit for econometricians to estimate the relationship between economic variables. For example. 1. An economist may be interested in studying the dependence of personal consumption expenditure on after tax or disposable real personal income. It gives us the value of MPC. 2. A monopolist who can fix the price or output (but not both) may want to find out the response to the demand for a change in price. It gives us price elasticity of demand. 3. A labor economist may want to study the impact of rate of change of money wages on the unemployment rate. 4. To examine the relationship between the amount of money, as a proportion of their income, that people would want to hold at various rates of inflation. (Money demand function) 5. The marketing director is interested in estimating the elasticity of demand w.r.t advertising expenditure. 6. Finally, an agronomist may be interested in studying the dependence of a particular crop yield, say, of wheat, on temperature, rainfall, amount of sunshine, and fertilizer.

Deterministic and Stochastic Relationships

Deterministic relationships between variables imply exact or nonrandom relationships which exist in mathematical models, if we plot graphs of these relationships then all data points will lie exactly on the line or curve.
\[Y=β_0+β_1 X\]

Stochastic relationships between variables imply inexact or random relationships. If we plot graphs of these models some data points lie above the line, on the line, or below the line. Econometric models are examples of stochastic relationships. \[Y=β_0+β_1 X+ε\]

Aspect	Conditional Mean	Unconditional Mean
Definition	It is the expected value of Y given the fixed values of X.	It is the expected value of Y, but this is not based on the X values.
Consideration of X	X variable is included	X variable is disregarded
Notation	E (Y \| X)	E (Y)
Example	What is the average weekly consumption of a family having a particular income level.	What is the average weekly consumption of a family.

Regression vs Causation

Regression analysis is concerned with the dependence of one variable on other variable(s), it does not necessarily imply causation (estimating cause and effect relationship). In the words of Kendall and Stuart, “A statistical relationship, however strong can never establish causal connection: causation must come from outside statistics, ultimately from some theory or other”. For example, if crop yield depends on rainfall, then there is no statistical reason to assume that crop yield does not affect rainfall, obviously common sense implies that the reverse cannot happen.

Regression Analysis	Correlation Analysis
The purpose of regression analysis is to estimate or predict the average value of the dependent variable on the given values of independent variables.	The purpose of correlation analysis is to measure the strength of linear association between two variables.
There is an asymmetry i.e. there is a distinction between the dependent and independent variables.	We treat any (two) variables symmetrically i.e. no distinction between the dependent and independent variables.
Y is random, X has given or fixed values	Both variables are assumed to be random

Terminology and Notation

Dependent Variable	Explanatory Variable
Dependent variable	Explanatory variable
Explained variable	Independent variable
Predictand	Predictor
Regressand	Regressor
Response	Stimulus
Endogenous	Exogenous
Outcome	Covariate
Controlled variable	Control variable

Population and Sample

Population refers to the entire group of individuals or entities about which inference is to be made. For example, if a researcher wants to study the average consumption and income of all households in a city (suppose 60), the population will consist of every household in that city.
Sample is a subset of population that is selected to represent entire population. For example, a researcher, rather than collecting consumption data on all 60 households, selects 20 households randomly to make an inference about the average consumption and income of those 60 households.

Understanding Regression Through an Example

Remember the purpose of regression analysis is to estimate or predict the population mean value of Y, on the basis of known or fixed values of X, that is to know the average consumption of 60 households given their incomes. To understand this, we take the data about consumption and income of 60 households given in table 1. These 60 households are divided into 10 income groups.

# Load necessary library
library(knitr)

# Create a data frame with the given data
data <- data.frame(
  Income = c(80, 100, 120, 140, 160, 180, 200, 220, 240, 260),
  `Cons_Fam_1` = c(55, 65, 79, 80, 102, 110, 120, 135, 137, 150),
  `Cons_Fam_2` = c(60, 70, 84, 93, 107, 115, 136, 137, 145, 152),
  `Cons_Fam_3` = c(65, 74, 90, 95, 110, 120, 140, 140, 155, 175),
  `Cons_Fam_4` = c(70, 80, 94, 103, 116, 130, 144, 152, 165, 178),
  `Cons_Fam_5` = c(75, 85, 98, 108, 118, 135, 145, 157, 175, 180),
  `Cons_Fam_6` = c(NA, 88, 113, 125, 140, NA, NA, 160, 189, 185),
  `Cons_Fam_7` = c(NA, NA, 115, NA, NA, NA, NA, 162, NA, 191),
  Total = c(325, 462, 445, 707, 678, 750, 685, 1043, 966, 1211),
  `Conditional_Mean` = c(65, 77, 89, 101, 113, 125, 137, 149, 161, 173)
)

# Print the table with formatting
kable(data, caption = "Weekly Family Income and Consumption Expenditure") %>%
  kable_styling(full_width = F) %>%
  row_spec(0, bold = TRUE) %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(10, bold = TRUE)

Weekly Family Income and Consumption Expenditure
Income	Cons_Fam_1	Cons_Fam_2	Cons_Fam_3	Cons_Fam_4	Cons_Fam_5	Cons_Fam_6	Cons_Fam_7	Total	Conditional_Mean
80	55	60	65	70	75	NA	NA	325	65
100	65	70	74	80	85	88	NA	462	77
120	79	84	90	94	98	113	115	445	89
140	80	93	95	103	108	125	NA	707	101
160	102	107	110	116	118	140	NA	678	113
180	110	115	120	130	135	NA	NA	750	125
200	120	136	140	144	145	NA	NA	685	137
220	135	137	140	152	157	160	162	1043	149
240	137	145	155	165	175	189	NA	966	161
260	150	152	175	178	180	185	191	1211	173

Data in table represents the whole population where Y is weekly consumption expenditure and X weekly income. Economic theory suggests that consumption expenditure increases with the increase in income. From table we see that there is considerable variation in weekly consumption expenditure in each income group. For example, all households with weekly income of $80 (in above table 1 there are five) have weekly consumption ranges from $55 to $75, similarly households with weekly income $100 (in above table 1 there are six) have weekly consumption ranges from $65 to $88. Moreover, there are also considerable variations in consumption across the groups, but note that on average, weekly consumption expenditure increases as income increases (see last row). In other words, households with higher level of income have higher consumption levels. For example, average weekly consumption of households whose income is $80 is $65, and average weekly consumption of households whose income is $160 is $113.

Population Regression Line

Figure 1 shows the expected values of weekly consumption Y, at various levels of Income. The circles show the mean values of Y (consumption) at each value of X (income). Remember that at each income level say $80, consumption can take any value within its probability distribution as shown by dark lines in figure 1. That`s why dependent variable is random. In other words, all households with income $80 cannot necessarily have consumption expenditures of $65, it can be above or below the 65 USD but their average consumption is $65. This is the purpose of regression analysis to predict or estimate the average value of Y, given various values of X.

Population Regression Function

Population regression function shows the functional relationship between conditional expected value of dependent variable E(Y|X) given the known or fixed values of independent variable Xi. It is also called conditional expected function.
\[E(Y|X)=f(X_i)\]

Assuming that consumption function is linear so we can write our population regression function as:

\[E(Y│Xi)=β_0+β_1 X_i\]

where β1 and β2 are unknown but fixed parameters known as the regression coefficients; β1 and β2 are also known as intercept and slope coefficients, respectively.

Population Regression Line
Population regression line or population regression curve is the combination of conditional means values of dependent variable Y, for each fixed or given values of independent variable. More simply, it is the curve connecting the means of the sub populations of Y corresponding to the given values of the regressor X.

Stochastic specification of PRF

Remember that an econometric model is a set of behavioral equations which represent relationship between economic vairables. It consists of some observed variables, and some unobserved variables. The observed variables are those that are included in the model often called independent variables or explanatory variables which explain the variations in Y, the dependent variable. For example, in our consumption function for example Consumption is a dependent variable whose variation we try to explain, and income is an independent variable who explains the variation in Consumption. There is not only one factor that affects consumption. It can be seen from figure 1 that corresponding to each income level consumption of each family is clustered around the mean consumption.
It can be measured from the vertical distance between population regression line (Solid straight line) which shows conditional mean consumption E(Y|Xi) and individual family consumption, Yi. Thus, \[u_i =Yi-E(Y|Xi)\]
\[Y_i=E(Y│X_i )+u_i\]
Substituting the value of E(Y│X_i ) in last equation which is β_0+β_1 Xi \[Y_i=β_0+β_1 X_i+u_i\]
Here u_i shows deviation of each family`s consumption from conditional mean consumption. It can be positive or negative. Technically it is called stochastic error term or disturbance term. This equation shows that consumption expenditure of each family is linearly related to its income plus a disturbance term. It means there are family specific factors that affect the consumption of each family. For example, family size, a large family has higher consumption level even though the income of that family is less than the income of small family. Even though we include family size in the model, there is some variation left that is not explained by income and family size. There are a lot of factors that we cannot include in the model either due to our limited knowledge about these factors or we cannot measure them. Even though if we include all possible factors in the model, there might be some measurement errors in observed variables itself. For example, income, no one share his/her exactly personal income. These are called measurement errors. Moreover, the purpose of regression analysis is to estimate the average consumption behavior of all families not he consumption for each family. Thus, we include a random error term in our population regression function which is the proxy of all those omitted or neglected factors other than X, that affect Y.

Consumption of 10 income groups

Sample Regression Function

Study of whole population is difficult as it is time consuming, energy consuming, and resource consuming. That`s why we instead of studying whole population we take a sample of this population which is the representative of whole population. Thus, in regression our task is to estimate the population regression function on the basis of sample regression function. In fact, we can draw “N” number of random samples and each random sample is not likely to be the same. In PRF our purpose is to find the average weekly consumption on the basis of given or fixed values of income. In SRF our task is to predict or estimate the average monthly consumption Y in the population as a whole corresponding to the chosen X. Remember that we cannot accurately forecast PRF using SRF because of sampling fluctuations. Because we can draw “N” samples from a given population there will be “N” SRF. Each SRF will provide a different estimate of population parameters. Suppose that we took two random samples from a population of 60 families and draw sample regression line for each sample SRL1 and SRL2. Which SRL is true representative of PRL. There is no way we can be absolutely sure that either of the regression lines shown in Figure 3 represents the true population regression line.

Random samples of 60 families

Sample regression lines

Sample regression function can be written as \[\hat{Y_i} = \hat{\beta_0} + \hat{\beta_1}X_i\]

where “Y-hat’’ or “Y-cap’’ which is the estimator of E(Y | Xi) (β_0 ) ̂= estimator of β0 (β_1 ) ̂= estimator of β1 We can also write our SRF in another form. \[Y_i = \hat{\beta_0} + \hat{\beta_1}X_i + \hat{u_i}\]

(u_i ) ̂ is considered as an estimate of u_i. Estimator and Estimate. Estimator is a rule or formula or method that tells us how to estimate the population parameter from the information provided by the sample. A particular numerical value obtained by the estimator is called estimate. Estimator is random whereas estimate is nonrandom. Looking forward… To sum up, then, we find our primary objective in regression analysis is to estimate the PRF on the basis of the SRF \[Y_i=β_0+β_1 X_i+u_i\] \[Y_i = \hat{\beta_0} + \hat{\beta_1}X_i + \hat{u_i}\]

We know that we can have as many SRF as number of random samples, so our question is which SRF best approximate the PRF. Can we devise a rule or a method that will make this approximation as “close” as possible? In other words, how should the SRF be constructed so that (β_0 ) ̂ is as “close” as possible to the true β1 and (β_1 ) ̂ is as “close” as possible to the true β1 even though we will never know the true β0 and β1? The answer is yes, and this method is called Ordinary Least Square Method which minimizes the sum of squared residuals.