Introduction

Chapter 3 was all about Simple Linear Regression. Simple Linear Regression is the process of creating a model which relates all of the data using a single straight line. We use this line to estimate the lienar relationship between two quantative variables.

Topics that will be covered include the following: exploratory data analysis, least square point estimates, model assumptions, testing the significance of slope and intercept, confidence intervals and prediction intervals, coefficients of determination and correlation, and f-tests.

To start we are going to get some data loaded into R. To do this we will use the library command to retrieve a package called “Lock5withR”

library(Lock5withR)

We are going to be looking at a data set withing Lock5withR called “SleepStudy”. We can upload that to R using the command attach(). By using attach, we won’t need to call out the specific data set and variable; we can just use the variable name.

attach(SleepStudy)

You can type ?SleepStudy into the console to understand what this data set is.

?SleepStudy

We can see that this is “data from a study of sleep patterns for college students”.

Now that we have our data all set, with a basic understanding, we will want to just do some basic analysis to get a feel for our data.

Exploratory Data Analysis

The command summary() can be used to get an overlook at your data. It gives you mean/median/mode/quartiles for all of your variables. Lets look at a summary of “SleepStudy”

summary(SleepStudy)
##      Gender         ClassYear        LarkOwl    NumEarlyClass  
##  Min.   :0.0000   Min.   :1.000   Lark   : 41   Min.   :0.000  
##  1st Qu.:0.0000   1st Qu.:2.000   Neither:163   1st Qu.:0.000  
##  Median :0.0000   Median :2.000   Owl    : 49   Median :2.000  
##  Mean   :0.4032   Mean   :2.478                 Mean   :1.735  
##  3rd Qu.:1.0000   3rd Qu.:3.000                 3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :4.000                 Max.   :5.000  
##    EarlyClass         GPA        ClassesMissed    CognitionZscore    
##  Min.   :0.000   Min.   :2.000   Min.   : 0.000   Min.   :-1.62e+00  
##  1st Qu.:0.000   1st Qu.:3.000   1st Qu.: 0.000   1st Qu.:-4.80e-01  
##  Median :1.000   Median :3.300   Median : 1.000   Median :-1.00e-02  
##  Mean   :0.664   Mean   :3.244   Mean   : 2.209   Mean   :-3.95e-05  
##  3rd Qu.:1.000   3rd Qu.:3.500   3rd Qu.: 3.000   3rd Qu.: 4.40e-01  
##  Max.   :1.000   Max.   :4.000   Max.   :20.000   Max.   : 1.96e+00  
##  PoorSleepQuality DepressionScore   AnxietyScore     StressScore    
##  Min.   : 1.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.000   1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 3.000  
##  Median : 6.000   Median : 3.000   Median : 4.000   Median : 8.000  
##  Mean   : 6.257   Mean   : 5.202   Mean   : 5.372   Mean   : 9.466  
##  3rd Qu.: 8.000   3rd Qu.: 7.000   3rd Qu.: 8.000   3rd Qu.:14.000  
##  Max.   :18.000   Max.   :35.000   Max.   :26.000   Max.   :37.000  
##  DepressionStatus  AnxietyStatus    Stress       DASScore    
##  moderate: 34     moderate: 56   high  : 56   Min.   : 0.00  
##  normal  :209     normal  :181   normal:197   1st Qu.: 7.00  
##  severe  : 10     severe  : 16                Median :16.00  
##                                               Mean   :20.04  
##                                               3rd Qu.:28.00  
##                                               Max.   :82.00  
##    Happiness        AlcoholUse      Drinks         WeekdayBed   
##  Min.   : 0.00   Abstain : 34   Min.   : 0.000   Min.   :21.80  
##  1st Qu.:24.00   Heavy   : 16   1st Qu.: 3.000   1st Qu.:24.20  
##  Median :28.00   Light   : 83   Median : 5.000   Median :24.80  
##  Mean   :26.11   Moderate:120   Mean   : 5.569   Mean   :24.85  
##  3rd Qu.:30.00                  3rd Qu.: 8.000   3rd Qu.:25.50  
##  Max.   :35.00                  Max.   :24.000   Max.   :29.10  
##   WeekdayRise      WeekdaySleep      WeekendBed     WeekendRise   
##  Min.   : 5.500   Min.   : 3.000   Min.   :21.50   Min.   : 5.25  
##  1st Qu.: 8.000   1st Qu.: 7.200   1st Qu.:24.88   1st Qu.: 9.25  
##  Median : 8.500   Median : 7.950   Median :25.50   Median :10.25  
##  Mean   : 8.586   Mean   : 7.866   Mean   :25.58   Mean   :10.20  
##  3rd Qu.: 9.150   3rd Qu.: 8.600   3rd Qu.:26.25   3rd Qu.:11.00  
##  Max.   :12.020   Max.   :10.970   Max.   :30.25   Max.   :15.00  
##   WeekendSleep     AverageSleep      AllNighter         Sex     
##  Min.   : 4.000   Min.   : 4.950   Min.   :0.0000   Female:151  
##  1st Qu.: 7.250   1st Qu.: 7.430   1st Qu.:0.0000   Male  :102  
##  Median : 8.250   Median : 8.000   Median :0.0000               
##  Mean   : 8.217   Mean   : 7.966   Mean   :0.1344               
##  3rd Qu.: 9.250   3rd Qu.: 8.590   3rd Qu.:0.0000               
##  Max.   :12.750   Max.   :10.620   Max.   :1.0000               
##  allNighter earlyClass
##  No :219    No : 85   
##  Yes: 34    Yes:168   
##                       
##                       
##                       
## 

SleepStudy has a lot of data. But we are interested in seeing “Does the number of alcoholic drinks per week a student has affect their GPA?”. Looking at our summary we see that students drink an average of 5.57 drinks per week and have an average GPA of 3.24. To go further, lets plot these two variables

To start, let’s plot the number of alcoholic drinks per week against the student’s GPA. We will make GPA the response variable and number of alcoholic drinks per week the explanatory variable; this is because we can hypothesis that if a student changes the number of alcoholic drinks per week they consume, their GPA will change.

We can plot number of alconolic drinks per week against GPA using the plot() command.The arguments will be explanatory variable, response variable, xlabel, ylabel, main label.

plot(Drinks, GPA ,xlab= "Alcoholic Drinks per Week", ylab = "GPA (0-4)", main = "Alcoholic Drinks vs GPA" )

When looking at this plot we see that there might be a slight downward trend; we can create a preliminary hypothesis of “The more alcoholic drinks a student has in a week, the lower their GPA tends to be”. To test this, lets try to fit a linear model and see if the prediction is right.

Least Squares

Prediction Equation

To see if there is a linear relationship between the number of Alcoholic Drinks per Week a student consumes and their GPA, we will need to create a linear model, or, least squares prediction equation, \(\hat{y}=b_0+b_1x\). Our Least Squares Prediction Equation for this model is \(\widehat{GPA}=b_0+b_1*AlcoholicDrinksPerWeek\).

We can create a linear model by using the lm command. The arguments are: lm(response var, explanatory var). Let’s call the model, “drinkmod” since we are looking at alcoholic drinks.

drinkmod <- lm(GPA ~ Drinks)
drinkmod
## 
## Call:
## lm(formula = GPA ~ Drinks)
## 
## Coefficients:
## (Intercept)       Drinks  
##     3.39186     -0.02659

Our output tells us which model we are calling (in this case GPA vs Alcoholic Drinks), and our slope and intercept. Our slope is under the word Drinks, since this is the slope for our predictor variable. Putting these values into our prediction equation we get: \[\hat{GPA}=3.392+(-0.027)*AlcholicDrinksPerWeek\]

This tells us that for every extra alcoholic drink per week a student consumes, their GPA will decrease by .03 points, on average.

Once we have our linear model, we want to make sure that the assumptions of simple linear regression hold.

Assumptions

There are 3 model assumptions that we have to check and make sure hold true before we can make statements about the relationship between Alcoholic drinks per week and GPA. They are: the potential error term must be normally distributed with mean 0, the potential error term’s variance must not depend on x, and the error terms must be statistically independent, or random.

Assumption 1: Normally distributed error term

To test and see if our model is normally distributed we can plot our residuals against our fitted values. We can do this using the command qqnorm and qqline. the command qqline() will plot a normal distribution in a line (instead of a histogram). It is easier to compare lines versus histograms. The command qqnorm() will plot our model values in a normal line.

When both the lines are graphed, we compare them. If the lines match up, the model is normally distributed.

qqnorm(drinkmod$residuals)
qqline(drinkmod$residuals)

We see in the Normal Q-Q Plot that our data is mostly on the line, except at either tail. This means that most our data is normally distributed, or it is nearly normal. If you were to be presenting this data in a highly technical setting, one would want to revisit whether or not it is normal. For the purpose of this guide, to show Simple Linear Regression, we will say that our data is normal enough.

Error Term with Mean 0

This is a simple assumption to test. Merely use the command mean() to find out the mean of our residuals.

mean(drinkmod$residuals)
## [1] 1.957435e-17

We see that the mean of our residuals, or potential error, is nearly 0. This means our assumption holds.

Assumption 2: Potential Error Term Variance

Our second assumption is that the variance of the potential error term is not related to x. This is called a test of homoscedasicity. To test for this, we will plot our model residuals against the model fitted values. We will use the plot() command as before with arguments of our residuals and fitted values. I also like to plot a horizontal line at zero just for visual reference when looking for constant variance.

plot(drinkmod$residuals ~ drinkmod$fitted.values, main = "Drinkmod Residuals vs Drinkmod Fitted Values", xlab="Fitted Values", ylab = "Residuals")
abline(0,0)

So we can see that for fitted value 3.0 and higher, there is definitely a constant variance. However, there is a little evidence of heteroscedasicity because of the outliers to the left side of the graph. This is further evidence to say that our linear model is not a good fit for our data. However, for illustration purposes, we will continue on.

Assumption 3: Error Terms Random

Our final assumption is that our error terms are random. To test this we can look at the same graph we have above. to minimize scrolling we will plot it below again. There is no new coding here, it is simply copied from above.

plot(drinkmod$residuals ~ drinkmod$fitted.values, main = "Drinkmod Residuals vs Drinkmod Fitted Values", xlab="Fitted Values", ylab = "Residuals")
abline(0,0)

We can see here that our points are not totally random. The points are clustered to the right side of our graph. This means our final assumption does not hold.

Since our first two assumptions barely held and the third one definitely doesn’t hold, we can say there is not a linear relationship between number of Alcoholic Drinks per Week and GPA. There are a few more points to make about simple linear regression, so we will continue with Alcoholic Drinks per Week and GPA to demonstrate the following points since we will still get numbers and discuss what they mean when a model is valid.

Point Estimates

A point estimate is merely choosing a value for a predictor and seeing what response comes out in your model. In practice, let’s say I consume 4 alcoholic drinks per week. We want to use our linear model to predict what my GPA is.

We can do this by simply plugging in our desired predictor value, in this case 4 drinks, into the linear model equation

3.39186+(-0.02659)*4
## [1] 3.2855

This means that for someone who drinks 4 alcoholic drinks per week, we expect their GPA to be 3.29, on average.

Testing Significance

Testing the significance of our \({\beta}\) is important. The null hypothesis is that \({\beta}=0\). So, if the pvalue of \({\beta}\) is low enough, we have enough evidence to say that \({\beta}\) is not equal to zero.

T-Test for Slope

To test the significance of our slope, or \({\beta_1}\). The null hypothesis of this t-test is that \({\beta_1}= 0\) and the alternative is \({\beta_1}\neq 0\).

If the pvalue of the regression coefficient of Drinks is low enough, we have enough evidence to say that the slope of our line is not zero, and there is some sort of relationship between the number of Alcoholic Drinks per Week and GPA. To get our pvalue, we will use the command summary() to get the summary of our linear model and our model name as the argument.

The summary command gives us a lot of different information, which we discuss later in this R guide, but for now let’s get the pvalue for the t-test for the slope.

summary(drinkmod)
## 
## Call:
## lm(formula = GPA ~ Drinks)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.39186 -0.25893  0.04107  0.26131  0.84742 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.391860   0.041455   81.82  < 2e-16 ***
## Drinks      -0.026587   0.006001   -4.43 1.41e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3901 on 251 degrees of freedom
## Multiple R-squared:  0.07252,    Adjusted R-squared:  0.06883 
## F-statistic: 19.63 on 1 and 251 DF,  p-value: 1.407e-05

You can find the pvalue for this t-test is in the row labeled “Drink” (about halfway down), and the column labeled “Pr(>|t|)”. You can see our pvalue is \(1.41*10^{-5}\). This means we reject the null that \({\beta_1}= 0\) in favor of the alternative that is doesn’t equal zero. This means the slope of linear model is not equal to zero and there is a linear relationship between the number of alcoholic drinks per week and GPA.

T-Test for the Intercept

Testing \({\beta_0}\), or the intercept, is not always useful (in the case of heights, a height of 0 doesn’t make sense), but in our case a \({\beta_0}\) would make sense because you can drink 0 alcoholic drinks per week.

To test \({\beta_0}\) is the exact same process as testing \({\beta_1}\). We will use the summary() command again, this time looking at the pvalue for our intercept.

summary(drinkmod)
## 
## Call:
## lm(formula = GPA ~ Drinks)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.39186 -0.25893  0.04107  0.26131  0.84742 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.391860   0.041455   81.82  < 2e-16 ***
## Drinks      -0.026587   0.006001   -4.43 1.41e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3901 on 251 degrees of freedom
## Multiple R-squared:  0.07252,    Adjusted R-squared:  0.06883 
## F-statistic: 19.63 on 1 and 251 DF,  p-value: 1.407e-05

Looking at the pvalue for our intercept, we see that it is less than \(2*10^{-16}\). This means that we have enough evidence to reject the null hypothesis of \({\beta_0} = 0\) and say that our intercept is not equal to 0. To interpret, this means that in our model, a person who drinks 0 alcoholic drinks per week will not have a GPA of 0.0, or all F’s.

F-Tests

Our last test is an F-test. The F-test is used to test the significance of our linear regression model. Our null hypothesis is that \(\beta_1=0\) and the alternative is that \(\beta_1 \neq 0\). A small pvalue will mean that our \(\beta_1 \neq 0\) and our simple linear regression model is significant.

We find the F-test stat using the summary() command again.

summary(drinkmod)
## 
## Call:
## lm(formula = GPA ~ Drinks)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.39186 -0.25893  0.04107  0.26131  0.84742 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.391860   0.041455   81.82  < 2e-16 ***
## Drinks      -0.026587   0.006001   -4.43 1.41e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3901 on 251 degrees of freedom
## Multiple R-squared:  0.07252,    Adjusted R-squared:  0.06883 
## F-statistic: 19.63 on 1 and 251 DF,  p-value: 1.407e-05

The F-test p-value can be found on the very last line all the way on the right. Our p-value is \(1.407 *10^{-5}\). This means that we can reject the null and say that our simple linear regression model is significant. In other words, there is a linear relationship between the number of alcoholic drinks per week a student drinks and their GPA.

Intervals

Another Important set of information you might want is the prediction and confidence intervals. A confidence interval interval tells us about the average, whereas prediction intervals tell us about the range a specific value can be in.

Confidence Intervals for Regression Coefficients

A confidence interval for a coefficient will tell us something along the lines of, “We are 95% confident that if we increase the number of alcoholic drinks per week a student consumes by 1 drink, the average GPA will increase between _ and _ points”. In equation form, the Confidence interval for the true slope is: \[[b_1 \pm t^{n-2}_{\alpha/2}*s_{b_1}]\]

Clearly this formula would be tedious to calculate by hand, so we can use the command confint() which finds the confidence interval for a parameter of a model. We will use arguments of our model name (drinkmod), which parameter we want the confidence interval for (‘Drinks’), and the level we want (in this case, 95%). Let’s see what the output looks like.

confint(drinkmod,'Drinks',level=0.95)
##              2.5 %      97.5 %
## Drinks -0.03840572 -0.01476766

This output gives us the lower bound and the upper bound of our 95% confidence interval. Our confidence interval is: [-0.0384,-0.0148]. In this case, we are 95% confident that if we increase the number of alcoholic drinks per week by 1 drink, the average GPA will decrease by at least .0148 points.

Sometimes we are curious to know about a specific value, such as 8 alcoholic drinks per week and what that means for GPA; to do this we can use confidence intervals for mean values and prediction intervals.

Confidence Intervals for Mean Value

A confidence interval for a mean value will tell us something along the lines of, “We are 95% confident that the mean GPA for a student who drinks __ alcoholic drinks per week is between these two numbers”. To illustrate this, we can put the number of alcoholic drinks we are interested into R. Let’s say we want to know about students who drink 8 alcoholic drinks per week. We need to make a data frame that sets the number of drinks to 8. This can be accomplished using the command data.frame() with the variable name equal to the number of alcoholic drinks per week we are interested in.

drinkdata <- data.frame(Drinks = 8)

Once we have the data frame made, we can use the predict() command to get our confidence level. The arguments are: model, data frame, interval = “confidence”, and level = .95. We want to tell R to do a confidence interval and a 95% confidence interval. Let’s see the output:

confidenceInterval<- predict(drinkmod,drinkdata, interval = "confidence", level = .95)
confidenceInterval
##        fit      lwr     upr
## 1 3.179167 3.122964 3.23537

We can see our confidence interval is [3.123, 3.235]. This means that we are 95% confident that the average GPA of a student who drinks 8 alcoholic drinks per week is between 3.123 and 3.235.

We also might be interested in the range of GPA someone who has 8 alcoholic drinks per week. To find out this we use a prediction interval.

Prediction Intervals

To look at a prediction interval for a student who drinks 8 alcoholic drinks per week, we will use the same predict command as above. The only difference is that our argument will be interval = “predict”. Let’s check out the output.

predictionInterval<- predict(drinkmod,drinkdata, interval = "predict", level = .95)
predictionInterval
##        fit      lwr      upr
## 1 3.179167 2.408782 3.949551

Our prediction Interval is [2.409,3.950]. This means that we are 95% confident that a student who drinks 8 alcoholic drinks per week will have a GPA between 2.409 and 3.950. This prediction interval is wider than the confidence interval because a confidence interval deals with averages. The prediction interval has to cover a larger range of numbers to account for student who might drink 8 alcoholic drinks per week and still get a very high GPA.

Prediction and confidence intervals can be very useful when you want data on a specific number in you predictor.

Coefficient of Determination and Correlation

A coefficient of determination is a measure of usefulness of a model while the correlation coefficient measures the relationship between response and predictor (in our case, GPA and Alcoholic Drinks per Week). The coefficient of determination we use is called \(r^2\) and the correlation coefficient is called r.

R Squared

The \(r^2\) value is the proportion of total variation in observations that is explained by the regression model. Ideally, we want \(r^2\) to be close to 1, but \(r^2\) can be anywhere between 0 and 1. We can find \(r^2\) by using the summary() command. We used this command earlier to find the significance our slope and intercept.

summary(drinkmod)
## 
## Call:
## lm(formula = GPA ~ Drinks)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.39186 -0.25893  0.04107  0.26131  0.84742 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.391860   0.041455   81.82  < 2e-16 ***
## Drinks      -0.026587   0.006001   -4.43 1.41e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3901 on 251 degrees of freedom
## Multiple R-squared:  0.07252,    Adjusted R-squared:  0.06883 
## F-statistic: 19.63 on 1 and 251 DF,  p-value: 1.407e-05

The \(r^2\) value is found on the second to last line. It is called Multiple R-squared here because you can use the summary() command to analyze multiple linear regression as well. We can see our r-squared in this model is .07252. This means that 7.252% of our data is explained in our linear model. We would expect to see and \(r^2\) this low because as we saw in the first few sections, this model doesn’t meet our assumptions anyway.

R

r is our correlation coefficient. Since \(r^2\) is between 0 and 1, and r is merely the \(\sqrt{r^2}\), r is always between -1 and 1. An r value close to -1 or 1 means there is a strong correlation while an r close to 0 implies there is little relationship between response and predictor.

Since r isn’t included in the summary() command, the easiest way to find r is to use the command cor(). This will give us the correlation coefficient, r, of our two variable. The two arguments are the predictor and the response, in our regression model: Drinks and GPA.

cor(Drinks, GPA)
## [1] -0.2693046

We can see that our r value is -0.269. Since our r is negative it means our data has a negative correlation, or, the more alcoholic drinks per week someone consumes, the lower their GPA tends to be. Since our r is fairly close to 0 though, we can say there is not a strong correlation between the number of alcoholic drinks per week a student consumes and their GPA. We would expect to see this low correlation since we found above that there is not a linear relationship between the number of alcoholic drinks per week a student consumes and their GPA.

Conclusion

We covered exploratory data analysis, least squares, testing significance, intervals, coefficient of determination and correlation and F-tests. Our model was testing to see if there is a linear relationship between the number of alcoholic drinks per week a student consumes and their GPA.

In general, one should get their data, plot it, decide to use a simple linear regression model, create the model, check all of the assumptions about the model, and then run some tests to make sure that the model is a good fit.

The Simple Linear Regression model for the number of alcoholic drinks per week a student consumes and their GPA is not significant. While we can calculate numbers for all of our tests, the model assumptions do not hold so the numbers are all scrapped. In this R-guide we continued to find the numbers for all of the tests and interpretations because the process is the same whether the model is valid or not. However, it is important to keep in mind that if your assumptions do not hold, one should not continue doing analysis on a linear model, unless it is for a project.