10/31/2020

Regression in Statistics

Regression describes one variable as a mathematical function of another.

\[\displaystyle Y=f(x) \]

Y is the dependent variable
X is the independent variable

Linear regression describes the relationship with the following function: \[\displaystyle Y= {\alpha}+{\beta}X \] \(\displaystyle {\alpha}\) is the Y intercept
\(\displaystyle {\beta}\) is the slope, or regression coefficient

Dataset USArrests

As an example data set we will be using R built-in USArrests as seen below.

data("USArrests")
attach(USArrests)
head(USArrests)
##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

Simple linear regression using Plotly

“Simple” means only two variables are considered. For example, below is a scatterplot with 2 variable of Rape vs Murder Arrest Rates (with Urban Population as colormap).

Multiple Regression using Plotly

We can make a more complex 3D scatter plot using Plotly. For Example, here’s a scatter plot of Murder vs Rape vs Assault Arrest Rates (with Urban Population as colormap).

Code for 2D and 3D plots in last 2 slides

Here’s the code for the 2D scatter plot:

plot_ly(USArrests, x=~Rape, y=~Murder,type="scatter", data = 
          mode="markers", color=UrbanPop ) 
          %>% colorbar(title ="UrbanPop")

And here’s the 3D scatter plot code:

plot_ly(USArrests, x=~Assault, y=~Murder, z=~Rape,
        color = ~UrbanPop )

In both plots, Urban Population Percentage is added as a colormap.

Regression line criterion and estimation

The best fitting line is found using the least squares criterion, where the residual sum of squares is minimized.

\[ \displaystyle SS_{residual}= \sum_{i=1}^n{(Y_i-\hat{Y}_i})^2\ \] To actually estimate the regression parameters \(\alpha\) and \(\beta\) we use the following two equations: \[ \displaystyle b= {\sum_{i=1}^n{(X_i-\bar{X})(Y_i-\bar{Y})} \over \sum_{i=1}^n{(X_i-\bar{X})^2 }} \]

\[ \displaystyle a= \bar{Y}-b\bar{X}\ \]

ggplot Scatterplot With Regression Line

Now we can plot a regression line on a scatter plot of Murder vs Assault in ggplot using stat_smooth(method = "lm")

Scatter plot With Row Names

To visualize the data better, we can again plot Rape vs Murder Rates but this time with state names instead of points as seen below.