2024-09-22

Regression Analysis

Regression analysis is a statistical method for analyzing a relationship between two or more variables in such a manner that one of the variables can be predicted or explained by the information on the other variables.

This technique is used to quantify a relationship between one or more predictor variables and one response variable. In numerous applications, regression is used to find a best fit model that can highlight cause/effect and predictive relationships between several factors.

Simple Linear Regression

When working with a single predictor and outcome to try and determine a relationship between the two variables, the simple linear regression form of regression analysis is used.

Linear regression refers to fitting a straight line to represent a trend in the data. In other words, it calculates the best fit slope when the dependent (outcome) variable and independent (predictor) variable are plotted on an \(xy\) coordinate plane.

The mathematical formula for simple linear regression is \(y = \beta_0 + \beta_1\cdot x + \varepsilon\), where \(\beta_0\) and \(\beta_1\) are the \(y\)-intercept and slope of the best fit line respectively, which are referred to as the regression coefficients, while \(\varepsilon\) represents the error term, which should be as close to zero as possible.

Simple Linear Regression Example

For example, we can use the state dataset to model simple linear regression on the relationship between per capita income (1974) and percentage of high-school graduates (1970) of the 50 states:

lm(Income ~ HSGrad, data = stateData)
## 
## Call:
## lm(formula = Income ~ HSGrad, data = stateData)
## 
## Coefficients:
## (Intercept)       HSGrad  
##     1931.10        47.16

Based on simple linear regression, the \(y = b + m\cdot x\) form best representing the relationship between Income and High School Graduates can be expressed as \(y = 1931.10 + 47.16\cdot x\), where \(y\) and \(x\) are income per capita and percent high-school graduates, respectively.

Simple Linear Regression Example Cont.

Having determined the regression equation, the best fit line can be overlaid onto the graph of the response variable (income per capita) and the predictor variable (percentage of high school graduates)

Multiple Linear Regression

For working with more than one predictor, multiple linear regression is the extension of simple linear regression used to predict an outcome variable (\(y\)) on the basis of multiple distinct predictor variables (\(x_1,x_2,x_3...\)).

Mathematically, the formula for the association between two predictor variables and the outcome is expressed by the equation \(y = \beta_0 + \beta_1\cdot x_1 + \beta_2\cdot x_2\).

Multiple Linear Regression Example

Using the state dataset, we can employ multiple linear regression to determine a model defining the best fit straight line representing how two factors - ‘percent high-school graduates (1970)’ and ‘murder and non-negligent manslaughter rate per 100,000 population (1976)’ - can predict the per capita income (1974) of each state.

## 
## Call:
## lm(formula = Income ~ HSGrad + Murder, data = stateData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1087.7  -294.7   -30.2   184.7  1128.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1626.92     624.73   2.604   0.0123 *  
## HSGrad         50.69       9.92   5.110 5.79e-06 ***
## Murder         15.82      21.70   0.729   0.4696    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 489.5 on 47 degrees of freedom
## Multiple R-squared:  0.3912, Adjusted R-squared:  0.3653 
## F-statistic:  15.1 on 2 and 47 DF,  p-value: 8.612e-06

Multiple Linear Regression Example Cont.

To analyze how multiple predictors collectively predict one outcome, often the relationship between each independent(predictor) variable is plotted against the dependent (outcome) variable for a visual/graphical representation of the individual linear relation.

Multiple Linear Regression Example Cont.

Multiple Linear Regression Example Cont

Here, we can see the general trends observed in previous models and confirmed by multiple linear regression: based on this data from the 1970s, states that were more educated and reported less crime (murder) tended to be wealthier by nature of a higher income per capita.

Beyond Regression: Visual Analysis of America in the 70s

Using regression, statisticians in every field are able to make predictions or draw conclusions from all kinds of data. In the case of the state dataset and other geopolitical data, regression models are used in conjunction with additional graphical representations like the map below, to create visuals that appeal to the general public on the news or other media.

For More Information