9/9/2021

#author: “Breanna Seitz”

What is Linear Regression

Linear Regression demonstrates the relationship between two or more variables in a data set, in the same way that a graph shows the relationship between x and y. Linear Regression is a common algorithm for statistical modeling because the results are easily understood.

Linear Regression assumes a linear relationship between the variables. This can be demonstrated by the women dataset when plotting height vs weight.

Simple Linear Regression Example

g <- ggplot(women, aes(height, weight)) + geom_point()
g

Coefficients and Residuals

Linear Regression modeling creates a ‘line of best fit’ of the form y = A + Bx. The coefficient A represents the y-intercept, and B represents the slope. This provides a predicted outcome for each input.

To test how good of a fit the model is, we can look at the residuals. Residuals are the distance from each real point to the predicted point.

To do this in R:

g = g + geom_smooth(method = "lm") 

Coefficients and Residuals Cont.

Here is an example using the same women dataset.

## `geom_smooth()` using formula 'y ~ x'

Is it a good fit?

To see whether a Linear Regression model is a good fit, there are a few things we can look at. First, we can look at the P-value. The smaller it is, the more accurate the data is.

Secondly, we can look how small the residuals are. The goal is to have these values be as close to zero as possible, which would mean the model is very close to the actual answer.

Lastly, we can look at the R-squared value. The closer this value is to 1, the better the model.

Is it a good fit? Example

## 
## Call:
## lm(formula = weight ~ height, data = women)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7333 -1.1333 -0.3833  0.7417  3.1167 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
## height        3.45000    0.09114   37.85 1.09e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared:  0.991,  Adjusted R-squared:  0.9903 
## F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14

We can see that the residuals are all relatively close to zero, which means this is a good model.

We can also see the Multiple R-squared value is 0.991, which is close to zero, also meaning this is a good fit.

Simple Linear Regression using Plotly

Here is an interactive plot showing the actual values, the predicted values, and the residuals for each point.

Conclusion

We can clearly see that Simple Linear Regression was a great statistical model for this data set. However, we know that not all variables have a linear relationship. There are many different ways to modify a linear regression model to better fit your data.

Source: https://www.datacamp.com/community/tutorials/linear-regression-R