2024-10-29

What is Linear Regression

  • A way of creating a estimate model of a dependent variable by a independent variable.
  • This model can be used for prediction of the dependent variable by the independent variable
  • A single independent variable is called univariate linear regression, or simple linear regression. Two or more independent variables is called multivariate regression, or multiple linear regression.

Simple Linear Regression

mod <- lm(Temp ~ Solar.R, data=aqc)
summary(mod)
## 
## Call:
## lm(formula = Temp ~ Solar.R, data = aqc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.735  -6.292   1.080   6.231  18.648 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 72.110720   1.970502  36.595  < 2e-16 ***
## Solar.R      0.030747   0.009571   3.212  0.00173 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.15 on 109 degrees of freedom
## Multiple R-squared:  0.08649,    Adjusted R-squared:  0.07811 
## F-statistic: 10.32 on 1 and 109 DF,  p-value: 0.001731

This shows the output of a simple linear regression model. The coefficients are the main statistic to look for first, as they are the elements of the model. In this case, we have \(0.028255\) as a coefficient for ‘Solar.R’ and \(72.863012\) for the intercept. So to create the model, it would look like this: \(Temp = 72.863012 + 0.028255*(Solar.R)\)

Graph of Simple Linear Regression

This graph shows how the regression line fits in the scatter plot of solar.R and Temp. It shows how the line fits the general trend of the data points.

Interpretation of Coefficents

  • For regression models, they are in the form of \(\hat{y} = \beta_0+\beta_1x\).
  • \(\beta_0\) is the constant for the model, and \(\beta_1\) is the coefficient for the \(x\), or independent variable.
  • For interpretation, we would say “If \(x\) is \(0\), then \(y\) is equal to \(\beta_0\)”. This is for interpretation of \(\beta_0\).
  • For \(\beta_1\), we would say “One unit increase in \(x\) would lead to a \(\beta_1\) increase in \(y\)”.
  • It is important to remember that when doing formal interpretations of \(\beta_0\) and \(\beta_1\), we need to say the actual variable name instead of \(x\) and \(y\).

Strength of Models

  • When judging how well models work, some main points you want to look for is the \(r^2\) statistic, and the significance of the independent variable.
  • \(r^2\) is a measure of how well the model fits the data, and how much variation there is between the actual data points and the points on the linear model.
ggplot(aqc, aes(Solar.R, Temp)) + geom_point()

We know from the results of our regression model that the \(r^2\) statistic is about .08 which is not good at all. This is evident in our scatter plot because the data is not close together and telling of a strong pattern.

ggplot(aqc, aes(Ozone, Temp)) + geom_point()

From this graph, we see a clear upward trend that seems decently strong and the data points are closer together. We can expect a much higher \(r^2\) statistic in this model than the previous.

Strength of Models pt.2

  • Checking the significance of the independent variable is important to checking the validity of a model.
  • This is basically a hypothesis test of whether or not the value of \(\beta_1\) is actually 0 or not.
  • The outputs of the regression model tell us all we need to know.
  • Ways to check this hypothesis is just like other tests: The t-statistic, p-value, or confidence interval, which are all provided in the r output summary except for the confidence interval.
  • If the null hypothesis of \(\beta_1 = 0\) is not rejected, the model is not trustworthy, because it can’t be proven that \(x\) has any effect on \(y\).

Conclusion

  • In conclusion, we can see how regression models can be useful for prediction and identifying trends within data.
  • We can also know how to interpret the meaning of the coefficient of the model, as well as what they mean.
  • Finally, we can now identify the key indicator statistics to test the trustworthiness of the model and how strong its prediction power is.