Introduction

We use a simple linear regression model to examine the relationship between two quantitative variables; a predictor and the response. The simple linear regression model allows us to determine whether or not there is a linear relationship between the two variables. To demonstrate this model, we use the UScereal data set for the MASS library and look at the amount of fat and the number of calories contatined in different types of cereal. We attach UScereal so that the dataset does not need to be specified each time we use the data.

library(MASS)
attach(UScereal)

Before we create out regression model, there are a few takaways from the data itself that we can get. First, we can look at a summary of the data set to better understand the data we are working with.

summary(UScereal)
##  mfr       calories        protein             fat            sodium     
##  G:22   Min.   : 50.0   Min.   : 0.7519   Min.   :0.000   Min.   :  0.0  
##  K:21   1st Qu.:110.0   1st Qu.: 2.0000   1st Qu.:0.000   1st Qu.:180.0  
##  N: 3   Median :134.3   Median : 3.0000   Median :1.000   Median :232.0  
##  P: 9   Mean   :149.4   Mean   : 3.6837   Mean   :1.423   Mean   :237.8  
##  Q: 5   3rd Qu.:179.1   3rd Qu.: 4.4776   3rd Qu.:2.000   3rd Qu.:290.0  
##  R: 5   Max.   :440.0   Max.   :12.1212   Max.   :9.091   Max.   :787.9  
##      fibre            carbo           sugars          shelf      
##  Min.   : 0.000   Min.   :10.53   Min.   : 0.00   Min.   :1.000  
##  1st Qu.: 0.000   1st Qu.:15.00   1st Qu.: 4.00   1st Qu.:1.000  
##  Median : 2.000   Median :18.67   Median :12.00   Median :2.000  
##  Mean   : 3.871   Mean   :19.97   Mean   :10.05   Mean   :2.169  
##  3rd Qu.: 4.478   3rd Qu.:22.39   3rd Qu.:14.00   3rd Qu.:3.000  
##  Max.   :30.303   Max.   :68.00   Max.   :20.90   Max.   :3.000  
##    potassium          vitamins 
##  Min.   : 15.00   100%    : 5  
##  1st Qu.: 45.00   enriched:57  
##  Median : 96.59   none    : 3  
##  Mean   :159.12                
##  3rd Qu.:220.00                
##  Max.   :969.70

We could use many of the variables in this data set as predictors for the number of calories in a cereal, but for this analysis, we will look solely at fat content to attempt to predict calories.

We can plot the data points of the grams of fat in a cereal and the number of calories in order to have an intial understanding of what we expect from the model.

plot(fat, calories, ylab = "Number of Calories", xlab = "Grams of Fat")

The data looks fairly linear in the plot, so it would not be surprising to find that there is a linear relationship between the number of calories and the grams of fat in cereal. There is a positive association between grams of fat, so as fat content increases, so does the number of calories. The data seems to follow the trend closely, with one very extreme outlier and a few less extreme ones.

Simple Linear Model

Next, we create our simple linear model:

(mod <- lm(calories ~ fat))
## 
## Call:
## lm(formula = calories ~ fat)
## 
## Coefficients:
## (Intercept)          fat  
##      117.60        22.36

From this linear model, we can determine that the regression equation is \[\hat{calories} = 117.60+22.36\times fat.\] The regression equation tells us that for every increase in fat by one gram, the number of calories increases by \(22.36\). Since we have no data where the amount of fat in a cereal is near \(0\), we cannot interpret the intercept of the equation.

We can plot the regression line with the original data to better see whether or not there is a linear trend in the data.

plot(fat, calories, ylab = "Number of Calories", xlab = "Grams of Fat")
abline(117.60, 22.36)

The regression line seems to fit the data fairly closely, so it is reasonable that this is the correct linear model for the data.

Checking Assumptions with Residuals

Given the regression equation that we found, we can predict the number of calories in a cereal based on how much fat they contain. For example, we can look at the predicted number of calories corespoinding to the first data point’s fat content:

(fat_actual <- UScereal[1,4])
## [1] 3.030303
(calories_pred <- coef(mod)%*%c(1, fat_actual))
##          [,1]
## [1,] 185.3595

So, based on our model, we predict that a cereal with \(3.030303\) grams of fat will have about \(185.3595\) calories. The actual number of calories that this cereal has is:

(calories_actual <- UScereal[1,2])
## [1] 212.1212

Then, the residual is

(resid <- calories_actual - calories_pred)
##          [,1]
## [1,] 26.76169

The residual tells us how far away from our prediction the actual value is. So, for a cereal with \(3.030303\) grams of fat, our expecatation for the numeber of calories in the cereal is \(26.76169\) lower than the actual value.

We can calculate the rediduals for every data point we have and then plot these residuals in a histogram to determine whether the error is approximately normal.

myresids <- mod$residuals
hist(myresids, xlab = "Residuals", main = "")

The residuals do not look very normal, which is an assumption we make when using the model. Thus, the results we get about UScereal may not be completely accurate. To better determine if the residuals are approximately normal, we can plot the quantiles of the residuals against the quantiles of a normal distribution to see how well they line up.

qqnorm(myresids)
qqline(myresids)

This plot also shows that the errors may not be as close to a normal distibution as we might like, so we will need to be careful in using any analysis that comes from this model.

Lastly, we can check our assumption of homoscedacticity. To see if the residuals are evenly spread out, we plot the residuals agains the fat content of the cereal.

plot(mod$residuals ~ fat, ylab = "Residuals", xlab = "Grams of Fat")
abline(0,0)

From this plot, it appears that the residuals are fairly evenly spread out, so the assumption of constant variance is met.

Data Analysis

We can use our simple linear model to look at relationships within the data.

T-test

With the model we created, we can check to see if fat and calories in cereal are related using a hypothesis test where \(H_0\) is that there is no relationship, or \(\beta_1 =0\) and \(H_A\) is that there is a relationship, so \(\beta_1\neq 0\). We find the results of the hypothesis test in the summary of the model:

summary(mod)
## 
## Call:
## lm(formula = calories ~ fat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -67.60 -29.96  -7.04  16.73 322.40 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  117.599      8.350  14.084  < 2e-16 ***
## fat           22.361      3.854   5.803 2.29e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.78 on 63 degrees of freedom
## Multiple R-squared:  0.3483, Adjusted R-squared:  0.338 
## F-statistic: 33.67 on 1 and 63 DF,  p-value: 2.292e-07

The test statistic is \(5.803\), which corresponds to a p-value of \(2.29\times 10^{-7}\) since \(\beta_1\) follows the standard normal distribution. Our p-value is very small, so we have enough evidence to reject the null hypothesis for the alternative hypothesis. Then, our data suggests that there is a relationship between the grams of fat in a cereal and the number of calories.

F-test

To get the same results, we can also perform an F-test to check for a relationship between fat and calories. The same null and alternative hypotheses apply for this test as for the previous one. The model summary tells us that the F-statistic is \(33.67\), and it follows an F distribution with degrees of freedom 1 and 63. Using this distribution, we get the same p-value of \(2.292\times 10^{-7}\). Again, we note that since the p-value is very small, we can conclude that there is sufficient evidence to demonstrate that fat content and number of calories in cereal are related.

Confidence Intervals

In addition to performing hypothesis tests on \(\beta_1\), we can also create a confidence interval to estimate \(\beta_0\) and \(\beta_1\).

confint(mod)
##                 2.5 %    97.5 %
## (Intercept) 100.91250 134.28519
## fat          14.66031  30.06174

This confidence interval tells us that we are \(95\)% confident that the true value of \(\beta_0\) is between \(100.91250\) and \(134.28519\) and \(\beta_1\) is between \(14.66103\) and \(30.06174\). So, in \(95\)% of all samples, the estimated \(\beta_0\) will be in the confidence interval found for \(\beta_0\), and in \(95\)% of all samples \(\beta_1\) will be in the confidence interval.

Next we look at the specific amount of fat contained in a cereal and the corresponding number of calories. Let’s consider cereals with \(1\) gram of fat.

newdata <- data.frame(fat = 1)

Then, we can create a confidence interval for the mean number of calories in a cereal when the number of grams of fat in the cereal is \(1\):

predict(mod, newdata, interval = "confidence")
##        fit      lwr      upr
## 1 139.9599 126.9591 152.9606

We are \(95\)% confident that the mean number of calories in a cereal with one gram of fat is in the interval \((126.9591, 152.9606)\).

Similarly, we can make a prediction interval to look at the number of calories a cereal with one gram of fat has.

predict(mod, newdata, interval = "predict")
##        fit      lwr      upr
## 1 139.9599 37.65109 242.2687

This test shows us that \(95\)% of cereals with one gram of fat have between \(37.65109\) and \(242.2687\) calories.

Variability

Lastly, we look at variability of the number of calories in cereal. Total variation is the sum of explained variation and unexplained variation. \(r^2\), the explained variation divided by the total variation, tells us the amount of variability of the response that’s explained by its linear relationship with the grams of fat in cereal.

summary(mod)
## 
## Call:
## lm(formula = calories ~ fat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -67.60 -29.96  -7.04  16.73 322.40 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  117.599      8.350  14.084  < 2e-16 ***
## fat           22.361      3.854   5.803 2.29e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.78 on 63 degrees of freedom
## Multiple R-squared:  0.3483, Adjusted R-squared:  0.338 
## F-statistic: 33.67 on 1 and 63 DF,  p-value: 2.292e-07

\(r^2\) is \(0.3483\), so \(34.67\%\) of the variability in number of calories is explained by the grams of fat in the cereal.

Summary

Thus, by creating a linear model to look at the relationship between fat content and calories in cereal, we have found that they are related by the following regression equation: \[\hat{calories} = 117.60+22.36\times fat.\]

We found that there is evidence of a linear relationship between grams of fat and the number of calories in cereal through the F-test and T-test. Also, \(34.67\%\) of the variation in calories can be explained by the amount of fat in a cereal.

However, when using these results, we need to be careful since even though the homoscedacticity assumption was met, the residuals did not follow a normal distribution.