Hello and welcome to Pete’s R-guide on Simple Linear Regression!

I want to bring you into the glorious world of statistics and teach you all that you need to know to be a successful statistician. Like many statisticians around the world would say, linear regression is a topic that is fundamental to understanding data.

Simple Linear Regression

Simple Linear Regression is a type of model. This model assumes that a pair of dependent and independent variables relationship can be approximated by a straight line. Testing this relationship is the aim of this chapter. So throughout this guide, I will be showing you how to develop the linear relationship, and the various ways to test the model we build.

I found a data set that had information on the number of murders, assaults, and rapes by state. R has a ton of different data sets available to practice on, and this was I thought would be interesting.

data(USArrests)
## The data function loads the specified data set "USArrests"
attach(USArrests)
## The attach function will include our data set into R's search path

Ohkay, so, to begin our regression analysis, a first step is to think about what you would like to analyze. I tend to view the data first, and then examine what relationships I think may have a relationship.

head(USArrests)

##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

## Will allow us to view the first few rows of our data in a table

A pair I think may be interesting to analyze is the relationship of Murders vs. Assaults. So I will generate a scatterplot and of those two variables. I will also introduce a function that will create a linear regression between our variables.

plot(Assault, Murder)

## plots our two variables on a xy graph

Hmm, it looks like there might be something to analyze here. With this scatterplot, my independent (predictor) variable is # of assaults, and the dependent (response) variable is # of murders. Our plot shows two main things: 1. A tendency that as Assaults increase, the number of Murders also increase in a “straight-line” fashion. 2. A scattering of points around our “straight-line.”

We can begin to create our model now. Look back at our scatterplot, and think to yourself, what if we had a line that approximately fit our data point? This would be called our “Line of Best Fit.” To fit a linear model to our data, we use the “lm” function. The function itself looks something like this: \[\hat{y_i}= \hat{\beta_0}+\hat{\beta_1} x_i\] The Beta coefficients are the weights for a variable. In our case of simple regression, try to think beta 1 as the “slope” of our relationship.

mod<-lm(Murder~Assault)
## When we write "var_name<-x" that is saying, var_name = x. We do this so we can reference data or a relationship easy and quickly without needing to retype functions in.
mod

## 
## Call:
## lm(formula = Murder ~ Assault)
## 
## Coefficients:
## (Intercept)      Assault  
##     0.63168      0.04191

##Calls the variable. This will show us our linear model and coefficients for our line of best fit

Using the values from above, we can plot our line against our data.

plot(Assault, Murder)
abline(.631683, .041909)

##abline takes in the slope and intercept and plots a line.

It truly does seem that the points hover around this line. There is an interesting concept to note here- Homoscedasticity vs. Heteroscedasticity. The term homoscedasticity means generally that as x-values increase, the variance in y-values stays the same, while heteroscedasticity means that as x-values increase the variance in y-values increases as well. Looking at our data with our line, notice that as our x- values increase, the y-values have larger distances from our line, or you could call it variance from our line. This would lead me to classify this relationship at heteroscedastic.

Model Testing

A model is not likely to be useful unless we can say there is a significant relationship between x & y. We introduce a null hypothesis to test the relationship. The null hypothesis would test if our weight atached to our x variable is equal to 0. Which means that any change in x would have no effect on our dependent (y) variable. With a test such as this, the goal is to say our weight is or isn’t 0, thus our alternative hypothesis would be that our weight is not equal to 0. Below is generally how most tests may look like, although there are variations: \[{H_0}:{\beta_1}=0\] And our alternative hypothesis: \[{H_a}:{\beta_1}{\=} 0\]

The function “summary” is able to compile many different testing statistics quick and easy:

summary(mod)

## 
## Call:
## lm(formula = Murder ~ Assault)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8528 -1.7456 -0.3979  1.3044  7.9256 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.631683   0.854776   0.739    0.464    
## Assault     0.041909   0.004507   9.298  2.6e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.629 on 48 degrees of freedom
## Multiple R-squared:  0.643,  Adjusted R-squared:  0.6356 
## F-statistic: 86.45 on 1 and 48 DF,  p-value: 2.596e-12

Notice that it returns the residual measures.

Residual Testing

One test, is to check if the residuals are normally distributed, which we can do by graphing a histogram of the data.

resid<-mod$residuals
hist(resid)

It looks slightly skewed, but in a general bell curve shape. We can plot these residuals against a straight line to compare.

qqnorm(resid)
qqline(resid)

This appears generally sound, with some larger variation in the later quantiles.

So we can continue our relationship testing.

Mean Squared Error

Another test that we can do is to estimate our Mean Squared Error. This measure is the sum of the squared residuals, over (n-2). \[{s^2}= SSE/(n-2)\] This is essentially giving us the residuals divided by our degrees of freedom(which is measured as n observations - # of coefficients estimating). The square root of this measure is given to us by our summary function from above. This will give us the amount of variation, after we apply our model to the data set.

F-Test

Another important test is an “F-Test.” This can be used to test the significance of the regression model. Which in a simple model such as ours, is another way to test our null hypothesis. Our F test number, is calculated as: \[F (Test Stat) = (Explained Variation/(Unexplained Variation/(n-2))) \]

A simple test with this is to define a level of significance (generally denoted as alpha), and to test if our F test is larger than an F value of alpha. so in our case, if we want to test a level of significance of .1, our F test stat holds true, which means we could reject the null hypothesis.

R-Squared Value

The last test stat that has importance, is the R-Squared value. The R squared value becomes important with multiple linear regression, and is calculated as follows: \[{r^2}=({Explained Variation}/{Total Variation})\] It is a popular choice when testing how well a model describes data sets. It’s values are between 0-1, with the goal to be near 1. That would mean that your model, explains nearly all of the total variance you see. This value can be calculated again by our summary function.

Prediction & Confidence Intervals

An important aspect of a model is it’s ability to predict future occurances. Not that we can predict the exact outcome, but if we can be, say 90% sure that based on a certain value, our observation would be between two bounds? This can be extremely helpful to companies and different business strategies, if they can be confident that they will see results within a certain range.

R is able to calculate this. Let’s see how:

Lets look at a prediction interval with 95% confidence, at 200=Assaults. This will return us a interval, which we can say with 95% confidence that the # murders will be within.

# create a new data frame
newdata <- data.frame("Assault"=200)

# prediction interval when Assaults=200
(predy <- predict(mod, newdata, interval="predict") )

##        fit      lwr      upr
## 1 9.013408 3.667549 14.35927

meaning that we’re 95% sure that there will be between 3.6-14.4 murders with that number of Assaults. You can adapt this model to different levels of confidence, with the interval shrinking as the confidence level decreases. The default level is 95% with this syntax.

A confidence interval will return the interval of where we can be 95% the true mean lies within.

Summary

I have highlighted what I believe to be essential areas of Chapter 3, which was all about building a simple linear regression model, how to test it’s significance and accuracy, and being able to use that model to predict future occurances.

Thanks for reading, I hope you were able to learn some stuff from this, I know it helped me become more familiar

R Guide- Chapter 3

Peter Roessler-Caram

February 15, 2018